GMAT: Global Memory Augmentation for Transformers

June 5, 2020

Ankit Gupta, Jonathan Berant

Computer Science

Statistics

Machine Learning

Computation and Language

Machine Learning

Transformer-based models have become ubiquitous in natural language processing thanks to their large capacity, innate parallelism and high performance. The contextualizing component of a Transformer block is the $\textit{pairwise dot-product}$ attention that has a large $\Omega(L^2)$ memory requirement for length $L$ sequences, limiting its ability to process long documents. This has been the subject of substantial interest recently, where multiple approximations were proposed to reduce the quadratic memory requirement using sparse attention matrices. In this work, we propose to augment sparse Transformer blocks with a dense attention-based $\textit{global memory}$ of length $M$ ($\ll L$) which provides an aggregate global view of the entire input sequence to each position. Our augmentation has a manageable $O(M\cdot(L+M))$ memory overhead, and can be seamlessly integrated with prior sparse solutions. Moreover, global memory can also be used for sequence compression, by representing a long input sequence with the memory representations only. We empirically show that our method leads to substantial improvement on a range of tasks, including (a) synthetic tasks that require global reasoning, (b) masked language modeling, and (c) reading comprehension.

Memory Transformer

June 20, 2020

92% Match

Mikhail S. Burtsev, Yuri Kuratov, ... , Sapunov Grigory V.

Computation and Language

Machine Learning

Neural and Evolutionary Comp...

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware representations. However, information about the context is stored mostly in the same element-wise representations. This might limit the processing of properties related to the sequence as a whole more difficult. Adding trainable memory to selective...

Find SimilarView on arXiv

Global memory transformer for processing long documents

December 3, 2022

92% Match

Arij Al Adel

Computation and Language

Machine Learning

Transformer variants dominate the state-of-the-art in different natural language processing tasks such as translation, reading comprehension and summarization. Our paper is more directed to use general memory slots added to the inputs and studying the results of adding these slots. This paper is a go on study of general memory slots rule that were added to the input of the proposed model in previous work. We have two main tasks;1) pretraining task using masked language modeli...

Find SimilarView on arXiv

Scaling Transformer to 1M tokens and beyond with RMT

April 19, 2023

91% Match

Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev

Computation and Language

Artificial Intelligence

Machine Learning

This technical report presents the application of a recurrent memory to extend the context length of BERT, one of the most effective Transformer-based models in natural language processing. By leveraging the Recurrent Memory Transformer architecture, we have successfully increased the model's effective context length to an unprecedented two million tokens, while maintaining high memory retrieval accuracy. Our method allows for the storage and processing of both local and glob...

Find Similar View on arXiv

MATTER: Memory-Augmented Transformer Using Heterogeneous Knowledge Sources

June 7, 2024

91% Match

Dongkyu Lee, Chandana Satya Prakash, ... , Lehmann Jens

Computation and Language

Artificial Intelligence

Leveraging external knowledge is crucial for achieving high performance in knowledge-intensive tasks, such as question answering. The retrieve-and-read approach is widely adopted for integrating external knowledge into a language model. However, this approach suffers from increased computational cost and latency due to the long context length, which grows proportionally with the number of retrieved knowledge. Furthermore, existing retrieval-augmented models typically retrieve...

Find SimilarView on arXiv

LSG Attention: Extrapolation of pretrained Transformers to long sequences

October 13, 2022

91% Match

Charles Condevaux, Sébastien Harispe

Computation and Language

Transformer models achieve state-of-the-art performance on a wide range of NLP tasks. They however suffer from a prohibitive limitation due to the self-attention mechanism, inducing $O(n^2)$ complexity with regard to sequence length. To answer this limitation we introduce the LSG architecture which relies on Local, Sparse and Global attention. We show that LSG attention is fast, efficient and competitive in classification and summarization tasks on long documents. Interesting...

Find SimilarView on arXiv

Recurrent Memory Transformer

July 14, 2022

90% Match

Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev

Computation and Language

Machine Learning

Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-attention. In this work, we propose and study a memory-augmented segment-level...

Find SimilarView on arXiv

Memformer: A Memory-Augmented Transformer for Sequence Modeling

October 14, 2020

90% Match

Qingyang Wu, Zhenzhong Lan, Kun Qian, Jing Gu, ... , Yu Zhou

Computation and Language

Transformers have reached remarkable success in sequence modeling. However, these models have efficiency issues as they need to store all the history token-level representations as memory. We present Memformer, an efficient neural network for sequence modeling, that utilizes an external dynamic memory to encode and retrieve past information. Our model achieves linear time complexity and constant memory space complexity when processing long sequences. We also propose a new opt...

Find SimilarView on arXiv

Uncertainty Guided Global Memory Improves Multi-Hop Question Answering

November 29, 2023

90% Match

Alsu Sagirova, Mikhail Burtsev

Computation and Language

Transformers have become the gold standard for many natural language processing tasks and, in particular, for multi-hop question answering (MHQA). This task includes processing a long document and reasoning over the multiple parts of it. The landscape of MHQA approaches can be classified into two primary categories. The first group focuses on extracting supporting evidence, thereby constraining the QA model's context to predicted facts. Conversely, the second group relies on ...

Find SimilarView on arXiv

Efficient Long Sequence Modeling via State Space Augmented Transformer

December 15, 2022

90% Match

Simiao Zuo, Xiaodong Liu, Jian Jiao, Denis Charles, Eren Manavoglu, ... , Gao Jianfeng

Computation and Language

Machine Learning

Transformer models have achieved superior performance in various natural language processing tasks. However, the quadratic computational cost of the attention mechanism limits its practicality for long sequences. There are existing attention variants that improve the computational efficiency, but they have limited ability to effectively compute global information. In parallel to Transformer models, state space models (SSMs) are tailored for long sequences, but they are not fl...

Find SimilarView on arXiv

HMT: Hierarchical Memory Transformer for Long Context Language Processing

May 9, 2024

90% Match

Zifan He, Zongyue Qin, Neha Prakriya, ... , Cong Jason

Computation and Language

Machine Learning

Transformer-based large language models (LLM) have been widely used in language processing applications. However, most of them restrict the context window that permits the model to attend to every token in the inputs. Previous works in recurrent models can memorize past tokens to enable unlimited context and maintain effectiveness. However, they have "flat" memory architectures, which have limitations in selecting and filtering information. Since humans are good at learning a...

Find SimilarView on arXiv