Extended Mind Transformers

June 4, 2024

Modifying Memories in Transformer Models

December 1, 2020

90% Match

Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, ... , Kumar Sanjiv

Computation and Language

Machine Learning

Large Transformer models have achieved impressive performance in many natural language tasks. In particular, Transformer based language models have been shown to have great capabilities in encoding factual knowledge in their vast amount of parameters. While the tasks of improving the memorization and generalization of Transformers have been widely studied, it is not well known how to make transformers forget specific old facts and memorize new ones. In this paper, we propose ...

Find SimilarView on arXiv

Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models

February 3, 2024

90% Match

Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, ... , Eshaghi Armaghan

Computation and Language

Machine Learning

Recently, large language models (LLMs) have shown remarkable capabilities including understanding context, engaging in logical reasoning, and generating responses. However, this is achieved at the expense of stringent computational and memory requirements, hindering their ability to effectively support long input sequences. This survey provides an inclusive review of the recent techniques and methods devised to extend the sequence length in LLMs, thereby enhancing their capac...

Find SimilarView on arXiv

Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing

January 10, 2024

90% Match

Zi Yang, Nan Hua

Computation and Language

As LLMs have become capable of processing more complex types of inputs, researchers have recently studied how to efficiently and affordably process possibly arbitrarily long sequences. One effective approach is to use a FIFO memory to store keys and values of an attention sublayer from past chunks to allow subsequent queries to attend. However, this approach requires a large memory and/or takes into the consideration the specific LM architecture. Moreover, due to the causal n...

Find SimilarView on arXiv

Do Transformers Need Deep Long-Range Memory

July 7, 2020

90% Match

Jack W. Rae, Ali Razavi

Machine Learning

Computation and Language

Machine Learning

Deep attention models have advanced the modelling of sequential data across many domains. For language modelling in particular, the Transformer-XL -- a Transformer augmented with a long-range memory of past activations -- has been shown to be state-of-the-art across a variety of well-studied benchmarks. The Transformer-XL incorporates a long-range memory at every layer of the network, which renders its state to be thousands of times larger than RNN predecessors. However it is...

Find SimilarView on arXiv

$\infty$-former: Infinite Memory Transformer

September 1, 2021

90% Match

Pedro Henrique Martins, Zita Marinho, André F. T. Martins

Computation and Language

Transformers are unable to model long-term memories effectively, since the amount of computation they need to perform grows with the context length. While variations of efficient transformers have been proposed, they all have a finite memory capacity and are forced to drop old information. In this paper, we propose the $\infty$-former, which extends the vanilla transformer with an unbounded long-term memory. By making use of a continuous-space attention mechanism to attend ov...

Find SimilarView on arXiv

Retrieval Head Mechanistically Explains Long-Context Factuality

April 24, 2024

89% Match

Wenhao Wu, Yizhong Wang, Guangxuan Xiao, ... , Fu Yao

Computation and Language

Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing propertie...

Find SimilarView on arXiv

Recurrent Memory Transformer

July 14, 2022

89% Match

Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev

Computation and Language

Machine Learning

Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-attention. In this work, we propose and study a memory-augmented segment-level...

Find SimilarView on arXiv

Current Limitations of Language Models: What You Need is Retrieval

September 15, 2020

89% Match

Aran Komatsuzaki

Computation and Language

Machine Learning

We classify and re-examine some of the current approaches to improve the performance-computes trade-off of language models, including (1) non-causal models (such as masked language models), (2) extension of batch length with efficient attention, (3) recurrence, (4) conditional computation and (5) retrieval. We identify some limitations (1) - (4) suffer from. For example, (1) currently struggles with open-ended text generation with the output loosely constrained by the input a...

Find SimilarView on arXiv

The NLP Task Effectiveness of Long-Range Transformers

February 16, 2022

89% Match

Guanghui Qin, Yukun Feng, Durme Benjamin Van

Computation and Language

Machine Learning

Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexity. This has led to Transformer variants seeking to lower computational complexity, such as Longformer and Performer. While such models have theoretically greater efficiency, their effectiveness on real NLP tasks has not been well studied. We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets. We design experiments to isolate the effect of pretra...

Find SimilarView on arXiv

Memory Transformer

June 20, 2020

89% Match

Mikhail S. Burtsev, Yuri Kuratov, ... , Sapunov Grigory V.

Computation and Language

Machine Learning

Neural and Evolutionary Comp...

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware representations. However, information about the context is stored mostly in the same element-wise representations. This might limit the processing of properties related to the sequence as a whole more difficult. Adding trainable memory to selective...

Find SimilarView on arXiv