Scaling Transformer to 1M tokens and bey...

Recurrent Memory Transformer

July 14, 2022

95% Match

Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev

Computation and Language

Machine Learning

Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-attention. In this work, we propose and study a memory-augmented segment-level...

Find Similar View on arXiv

Associative Recurrent Memory Transformer

July 5, 2024

92% Match

Ivan Rodkin, Yuri Kuratov, ... , Burtsev Mikhail

Computation and Language

Artificial Intelligence

Machine Learning

This paper addresses the challenge of creating a neural architecture for very long sequences that requires constant time for processing new information at each time step. Our approach, Associative Recurrent Memory Transformer (ARMT), is based on transformer self-attention for local context and segment-level recurrence for storage of task specific information distributed over a long context. We demonstrate that ARMT outperfors existing alternatives in associative retrieval tas...

Find Similar View on arXiv

HMT: Hierarchical Memory Transformer for Long Context Language Processing

May 9, 2024

92% Match

Zifan He, Zongyue Qin, Neha Prakriya, ... , Cong Jason

Computation and Language

Machine Learning

Transformer-based large language models (LLM) have been widely used in language processing applications. However, most of them restrict the context window that permits the model to attend to every token in the inputs. Previous works in recurrent models can memorize past tokens to enable unlimited context and maintain effectiveness. However, they have "flat" memory architectures, which have limitations in selecting and filtering information. Since humans are good at learning a...

Find Similar View on arXiv

Memformer: A Memory-Augmented Transformer for Sequence Modeling

October 14, 2020

92% Match

Qingyang Wu, Zhenzhong Lan, Kun Qian, Jing Gu, ... , Yu Zhou

Computation and Language

Transformers have reached remarkable success in sequence modeling. However, these models have efficiency issues as they need to store all the history token-level representations as memory. We present Memformer, an efficient neural network for sequence modeling, that utilizes an external dynamic memory to encode and retrieve past information. Our model achieves linear time complexity and constant memory space complexity when processing long sequences. We also propose a new opt...

Find Similar View on arXiv

GMAT: Global Memory Augmentation for Transformers

June 5, 2020

91% Match

Ankit Gupta, Jonathan Berant

Machine Learning

Computation and Language

Machine Learning

Transformer-based models have become ubiquitous in natural language processing thanks to their large capacity, innate parallelism and high performance. The contextualizing component of a Transformer block is the $\textit{pairwise dot-product}$ attention that has a large $\Omega(L^2)$ memory requirement for length $L$ sequences, limiting its ability to process long documents. This has been the subject of substantial interest recently, where multiple approximations were propose...

Find Similar View on arXiv

A Primer in BERTology: What we know about how BERT works

February 27, 2020

91% Match

Anna Rogers, Olga Kovaleva, Anna Rumshisky

Computation and Language

Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. W...

Find Similar View on arXiv

Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size

August 17, 2020

91% Match

Davis Yoshida, Allyson Ettinger, Kevin Gimpel

Computation and Language

Fine-tuning a pretrained transformer for a downstream task has become a standard method in NLP in the last few years. While the results from these models are impressive, applying them can be extremely computationally expensive, as is pretraining new models with the latest architectures. We present a novel method for applying pretrained transformer language models which lowers their memory requirement both at training and inference time. An additional benefit is that our metho...

Find Similar View on arXiv

Memory Transformer

June 20, 2020

91% Match

Mikhail S. Burtsev, Yuri Kuratov, ... , Sapunov Grigory V.

Computation and Language

Machine Learning

Neural and Evolutionary Comp...

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware representations. However, information about the context is stored mostly in the same element-wise representations. This might limit the processing of properties related to the sequence as a whole more difficult. Adding trainable memory to selective...

Find Similar View on arXiv

In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

February 16, 2024

91% Match

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Dmitry Sorokin, ... , Burtsev Mikhail

Computation and Language

Artificial Intelligence

Machine Learning

This paper addresses the challenge of processing long documents using generative transformer models. To evaluate different approaches, we introduce BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts within extensive texts. Our evaluation, which includes benchmarks for GPT-4 and RAG, reveals that common methods are effective only for sequences up to $10^4$ elements. In contrast, fine-tuning GPT-2 with recurrent memory...

Find Similar View on arXiv

Extended Mind Transformers

June 4, 2024

91% Match

Phoebe Klett, Thomas Ahle

Machine Learning

Computation and Language

Pre-trained language models demonstrate general intelligence and common sense, but long inputs quickly become a bottleneck for memorizing information at inference time. We resurface a simple method, Memorizing Transformers (Wu et al., 2022), that gives the model access to a bank of pre-computed memories. We show that it is possible to fix many of the shortcomings of the original method, such as the need for fine-tuning, by critically assessing how positional encodings should ...

Find Similar View on arXiv

Scaling Transformer to 1M tokens and beyond with RMT

Recurrent Memory Transformer

Associative Recurrent Memory Transformer

HMT: Hierarchical Memory Transformer for Long Context Language Processing

Memformer: A Memory-Augmented Transformer for Sequence Modeling

GMAT: Global Memory Augmentation for Transformers

A Primer in BERTology: What we know about how BERT works

Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size

Memory Transformer

In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Extended Mind Transformers