Recurrent Memory Transformer

R-Transformer: Recurrent Neural Network Enhanced Transformer

July 12, 2019

89% Match

Zhiwei Wang, Yao Ma, ... , Tang Jiliang

Machine Learning

Computation and Language

Computer Vision and Pattern ...

Audio and Speech Processing

Recurrent Neural Networks have long been the dominating choice for sequence modeling. However, it severely suffers from two issues: impotent in capturing very long-term dependencies and unable to parallelize the sequential computation procedure. Therefore, many non-recurrent sequence models that are built on convolution and attention operations have been proposed recently. Notably, models with multi-head attention such as Transformer have demonstrated extreme effectiveness in...

Find SimilarView on arXiv

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

January 9, 2019

89% Match

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, ... , Salakhutdinov Ruslan

Machine Learning

Computation and Language

Machine Learning

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fra...

Find SimilarView on arXiv

Titans: Learning to Memorize at Test Time

December 31, 2024

89% Match

Ali Behrouz, Peilin Zhong, Vahab Mirrokni

Machine Learning

Artificial Intelligence

Computation and Language

Over more than a decade there has been an extensive research effort on how to effectively utilize recurrent models and attention. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allows attending to the entire context window, capturing the direct dependencies of all tokens. This more accurate modeling of dependencies, however, comes with a quadratic cost, limiting the model to a fixed-length context. We present a new ne...

Find SimilarView on arXiv

Blockwise Parallel Transformer for Large Context Models

May 30, 2023

88% Match

Hao Liu, Pieter Abbeel

Computation and Language

Machine Learning

Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands posed by the self-attention mechanism and the large feedforward network in Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving multiple long sequences or long-term dependencies. We present a distinct approach, Blockwise P...

Find SimilarView on arXiv

TransfoRNN: Capturing the Sequential Information in Self-Attention Representations for Language Modeling

April 4, 2021

88% Match

Tze Yuang Chong, Xuyang Wang, ... , Wang Junjie

Computation and Language

In this paper, we describe the use of recurrent neural networks to capture sequential information from the self-attention representations to improve the Transformers. Although self-attention mechanism provides a means to exploit long context, the sequential information, i.e. the arrangement of tokens, is not explicitly captured. We propose to cascade the recurrent neural networks to the Transformers, which referred to as the TransfoRNN model, to capture the sequential informa...

Find SimilarView on arXiv

Do Transformers Need Deep Long-Range Memory

July 7, 2020

88% Match

Jack W. Rae, Ali Razavi

Machine Learning

Computation and Language

Machine Learning

Deep attention models have advanced the modelling of sequential data across many domains. For language modelling in particular, the Transformer-XL -- a Transformer augmented with a long-range memory of past activations -- has been shown to be state-of-the-art across a variety of well-studied benchmarks. The Transformer-XL incorporates a long-range memory at every layer of the network, which renders its state to be thousands of times larger than RNN predecessors. However it is...

Find SimilarView on arXiv

Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size

August 17, 2020

88% Match

Davis Yoshida, Allyson Ettinger, Kevin Gimpel

Computation and Language

Fine-tuning a pretrained transformer for a downstream task has become a standard method in NLP in the last few years. While the results from these models are impressive, applying them can be extremely computationally expensive, as is pretraining new models with the latest architectures. We present a novel method for applying pretrained transformer language models which lowers their memory requirement both at training and inference time. An additional benefit is that our metho...

Find SimilarView on arXiv

Memory Layers at Scale

December 12, 2024

88% Match

Vincent-Pierre Berges, Barlas Oğuz, Daniel Haziza, Wen-tau Yih, ... , Ghosh Gargi

Computation and Language

Artificial Intelligence

Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform de...

Find SimilarView on arXiv

MEMORYLLM: Towards Self-Updatable Large Language Models

February 7, 2024

88% Match

Yu Wang, Xiusi Chen, ... , McAuley Julian

Computation and Language

Existing Large Language Models (LLMs) usually remain static after deployment, which might make it hard to inject new knowledge into the model. We aim to build models containing a considerable portion of self-updatable parameters, enabling the model to integrate new knowledge effectively and efficiently. To this end, we introduce MEMORYLLM, a model that comprises a transformer and a fixed-size memory pool within the latent space of the transformer. MEMORYLLM can self-update wi...

Find SimilarView on arXiv

Global memory transformer for processing long documents

December 3, 2022

88% Match

Arij Al Adel

Computation and Language

Machine Learning

Transformer variants dominate the state-of-the-art in different natural language processing tasks such as translation, reading comprehension and summarization. Our paper is more directed to use general memory slots added to the inputs and studying the results of adding these slots. This paper is a go on study of general memory slots rule that were added to the input of the proposed model in previous work. We have two main tasks;1) pretraining task using masked language modeli...

Find SimilarView on arXiv