ID: 2006.11527

Memory Transformer

June 20, 2020

View on ArXiv
Mikhail S. Burtsev, Yuri Kuratov, Anton Peganov, Grigory V. Sapunov
Computer Science
Computation and Language
Machine Learning
Neural and Evolutionary Comp...

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware representations. However, information about the context is stored mostly in the same element-wise representations. This might limit the processing of properties related to the sequence as a whole more difficult. Adding trainable memory to selectively store local as well as global representations of a sequence is a promising direction to improve the Transformer model. Memory-augmented neural networks (MANNs) extend traditional neural architectures with general-purpose memory for representations. MANNs have demonstrated the capability to learn simple algorithms like Copy or Reverse and can be successfully trained via backpropagation on diverse tasks from question answering to language modeling outperforming RNNs and LSTMs of comparable complexity. In this work, we propose and study few extensions of the Transformer baseline (1) by adding memory tokens to store non-local representations, (2) creating memory bottleneck for the global information, (3) controlling memory update with dedicated layer. We evaluate these memory augmented Transformers and demonstrate that presence of memory positively correlates with the model performance for machine translation and language modelling tasks. Augmentation of pre-trained masked language model with memory tokens shows mixed results for tasks from GLUE benchmark. Visualization of attention patterns over the memory suggest that it improves the model's ability to process a global context.

Similar papers 1

Recurrent Memory Transformer

July 14, 2022

94% Match
Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev
Computation and Language
Machine Learning

Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-attention. In this work, we propose and study a memory-augmented segment-level...

Find SimilarView on arXiv

GMAT: Global Memory Augmentation for Transformers

June 5, 2020

92% Match
Ankit Gupta, Jonathan Berant
Machine Learning
Computation and Language
Machine Learning

Transformer-based models have become ubiquitous in natural language processing thanks to their large capacity, innate parallelism and high performance. The contextualizing component of a Transformer block is the $\textit{pairwise dot-product}$ attention that has a large $\Omega(L^2)$ memory requirement for length $L$ sequences, limiting its ability to process long documents. This has been the subject of substantial interest recently, where multiple approximations were propose...

Find SimilarView on arXiv

Memformer: A Memory-Augmented Transformer for Sequence Modeling

October 14, 2020

91% Match
Qingyang Wu, Zhenzhong Lan, Kun Qian, Jing Gu, ... , Yu Zhou
Computation and Language

Transformers have reached remarkable success in sequence modeling. However, these models have efficiency issues as they need to store all the history token-level representations as memory. We present Memformer, an efficient neural network for sequence modeling, that utilizes an external dynamic memory to encode and retrieve past information. Our model achieves linear time complexity and constant memory space complexity when processing long sequences. We also propose a new opt...

Find SimilarView on arXiv
Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev
Computation and Language
Artificial Intelligence
Machine Learning

This technical report presents the application of a recurrent memory to extend the context length of BERT, one of the most effective Transformer-based models in natural language processing. By leveraging the Recurrent Memory Transformer architecture, we have successfully increased the model's effective context length to an unprecedented two million tokens, while maintaining high memory retrieval accuracy. Our method allows for the storage and processing of both local and glob...

MEMORYLLM: Towards Self-Updatable Large Language Models

February 7, 2024

90% Match
Yu Wang, Xiusi Chen, ... , McAuley Julian
Computation and Language

Existing Large Language Models (LLMs) usually remain static after deployment, which might make it hard to inject new knowledge into the model. We aim to build models containing a considerable portion of self-updatable parameters, enabling the model to integrate new knowledge effectively and efficiently. To this end, we introduce MEMORYLLM, a model that comprises a transformer and a fixed-size memory pool within the latent space of the transformer. MEMORYLLM can self-update wi...

Find SimilarView on arXiv

Augmenting Self-attention with Persistent Memory

July 2, 2019

90% Match
Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, ... , Joulin Armand
Machine Learning
Computation and Language
Machine Learning

Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long term dependencies and are often regarded as the key ingredient in the success of Transformers. Building upon this intuition, we propose a new model that solely consists of attention layers. More precisely, we augment the self-attention laye...

Find SimilarView on arXiv

Transformer with Memory Replay

May 19, 2022

90% Match
Rui Liu, Barzan Mozafari
Machine Learning

Transformers achieve state-of-the-art performance for natural language processing tasks by pre-training on large-scale text corpora. They are extremely compute-intensive and have very high sample complexity. Memory replay is a mechanism that remembers and reuses past examples by saving to and replaying from a memory buffer. It has been successfully used in reinforcement learning and GANs due to better sample efficiency. In this paper, we propose \emph{Transformer with Memory ...

Find SimilarView on arXiv

Stateful Memory-Augmented Transformers for Efficient Dialogue Modeling

September 15, 2022

90% Match
Qingyang Wu, Zhou Yu
Computation and Language

Transformer encoder-decoder models have achieved great performance in dialogue generation tasks, however, their inability to process long dialogue history often leads to truncation of the context To address this problem, we propose a novel memory-augmented transformer that is compatible with existing pre-trained encoder-decoder models and enables efficient preservation of the dialogue history information. By incorporating a separate memory module alongside the pre-trained tra...

Find SimilarView on arXiv

LaMemo: Language Modeling with Look-Ahead Memory

April 15, 2022

90% Match
Haozhe Ji, Rongsheng Zhang, Zhenyu Yang, ... , Huang Minlie
Computation and Language

Although Transformers with fully connected self-attentions are powerful to model long-term dependencies, they are struggling to scale to long texts with thousands of words in language modeling. One of the solutions is to equip the model with a recurrence memory. However, existing approaches directly reuse hidden states from the previous segment that encodes contexts in a uni-directional way. As a result, this prohibits the memory to dynamically interact with the current conte...

Find SimilarView on arXiv

Birth of a Transformer: A Memory Viewpoint

June 1, 2023

89% Match
Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, ... , Bottou Leon
Machine Learning
Computation and Language
Machine Learning

Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthe...

Find SimilarView on arXiv