ID: 2010.06891

Memformer: A Memory-Augmented Transformer for Sequence Modeling

October 14, 2020

View on ArXiv
Qingyang Wu, Zhenzhong Lan, Kun Qian, Jing Gu, Alborz Geramifard, Zhou Yu
Computer Science
Computation and Language

Transformers have reached remarkable success in sequence modeling. However, these models have efficiency issues as they need to store all the history token-level representations as memory. We present Memformer, an efficient neural network for sequence modeling, that utilizes an external dynamic memory to encode and retrieve past information. Our model achieves linear time complexity and constant memory space complexity when processing long sequences. We also propose a new optimization scheme, memory replay back-propagation (MRBP), which promotes long-range back-propagation through time with a significantly reduced memory requirement. Experimental results show that Memformer has achieved comparable performance compared to the baselines by using 8.1x less memory space and 3.2x faster on inference. Analysis of the attention pattern shows that our external memory slots can encode and retain important information through timesteps.

Similar papers 1

Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev
Computation and Language
Artificial Intelligence
Machine Learning

This technical report presents the application of a recurrent memory to extend the context length of BERT, one of the most effective Transformer-based models in natural language processing. By leveraging the Recurrent Memory Transformer architecture, we have successfully increased the model's effective context length to an unprecedented two million tokens, while maintaining high memory retrieval accuracy. Our method allows for the storage and processing of both local and glob...

Recurrent Memory Transformer

July 14, 2022

92% Match
Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev
Computation and Language
Machine Learning

Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-attention. In this work, we propose and study a memory-augmented segment-level...

Find SimilarView on arXiv

Transformer with Memory Replay

May 19, 2022

92% Match
Rui Liu, Barzan Mozafari
Machine Learning

Transformers achieve state-of-the-art performance for natural language processing tasks by pre-training on large-scale text corpora. They are extremely compute-intensive and have very high sample complexity. Memory replay is a mechanism that remembers and reuses past examples by saving to and replaying from a memory buffer. It has been successfully used in reinforcement learning and GANs due to better sample efficiency. In this paper, we propose \emph{Transformer with Memory ...

Find SimilarView on arXiv

Memory Transformer

June 20, 2020

91% Match
Mikhail S. Burtsev, Yuri Kuratov, ... , Sapunov Grigory V.
Computation and Language
Machine Learning
Neural and Evolutionary Comp...

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware representations. However, information about the context is stored mostly in the same element-wise representations. This might limit the processing of properties related to the sequence as a whole more difficult. Adding trainable memory to selective...

Find SimilarView on arXiv

Linearizing Transformer with Key-Value Memory

March 23, 2022

91% Match
Yizhe Zhang, Deng Cai
Computation and Language
Machine Learning

Efficient transformer variants with linear time complexity have been developed to mitigate the quadratic computational overhead of the vanilla transformer. Among them are low-rank projection methods such as Linformer and kernel-based Transformers. Despite their unique merits, they usually suffer from a performance drop comparing with the vanilla transformer on many sequence generation tasks, and often fail to obtain computation gain when the generation is short. We propose Me...

Find SimilarView on arXiv

MemLong: Memory-Augmented Retrieval for Long Text Modeling

August 30, 2024

91% Match
Weijie Liu, Zecheng Tang, Juntao Li, ... , Zhang Min
Computation and Language
Artificial Intelligence

Recent advancements in Large Language Models (LLMs) have yielded remarkable success across diverse fields. However, handling long contexts remains a significant challenge for LLMs due to the quadratic time and space complexity of attention mechanisms and the growing memory consumption of the key-value cache during generation. This work introduces MemLong: Memory-Augmented Retrieval for Long Text Generation, a method designed to enhance the capabilities of long-context languag...

Find SimilarView on arXiv

MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers

November 20, 2024

91% Match
Ning Ding, Yehui Tang, Haochen Qin, Zhenli Zhou, Chao Xu, Lin Li, Kai Han, ... , Wang Yunhe
Computation and Language

In order to reduce the computational complexity of large language models, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and corresponding computational complexity are constantly scaled up in pursuit of higher performance. In this work, we present MemoryFormer, a novel transformer architecture which significantly reduces the computational complexity (FLOPs) from a new perspe...

Find SimilarView on arXiv

MeMo: Towards Language Models with Associative Memory Mechanisms

February 18, 2025

91% Match
Fabio Massimo Zanzotto, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Leonardo Ranaldi, Davide Venditti, Federico Ranaldi, Cristina Giannone, ... , Romagnoli Raniero
Computation and Language
Artificial Intelligence

Memorization is a fundamental ability of Transformer-based Large Language Models, achieved through learning. In this paper, we propose a paradigm shift by designing an architecture to memorize text directly, bearing in mind the principle that memorization precedes learning. We introduce MeMo, a novel architecture for language modeling that explicitly memorizes sequences of tokens in layered associative memories. By design, MeMo offers transparency and the possibility of model...

Find SimilarView on arXiv

Stateful Memory-Augmented Transformers for Efficient Dialogue Modeling

September 15, 2022

91% Match
Qingyang Wu, Zhou Yu
Computation and Language

Transformer encoder-decoder models have achieved great performance in dialogue generation tasks, however, their inability to process long dialogue history often leads to truncation of the context To address this problem, we propose a novel memory-augmented transformer that is compatible with existing pre-trained encoder-decoder models and enables efficient preservation of the dialogue history information. By incorporating a separate memory module alongside the pre-trained tra...

Find SimilarView on arXiv

Addressing Some Limitations of Transformers with Feedback Memory

February 21, 2020

91% Match
Angela Fan, Thibaut Lavril, Edouard Grave, ... , Sukhbaatar Sainbayar
Machine Learning
Computation and Language
Machine Learning

Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. Unlike recurrent neural networks, Transformers use attention to capture temporal relations while processing input tokens in parallel. While this parallelization makes them computationally efficient, it restricts the model from fully exploiting the sequential nature of the input. The representation at a given layer can only access representations from lower laye...

Find SimilarView on arXiv