ID: 2407.04841

Associative Recurrent Memory Transformer

July 5, 2024

View on ArXiv

Similar papers 2

HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models

May 23, 2024

89% Match
Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, ... , Su Yu
Computation and Language
Artificial Intelligence

In order to thrive in hostile and ever-changing natural environments, mammalian brains evolved to store large amounts of knowledge about the world and continually integrate new information while avoiding catastrophic forgetting. Despite the impressive accomplishments, large language models (LLMs), even with retrieval-augmented generation (RAG), still struggle to efficiently and effectively integrate a large amount of new experiences after pre-training. In this work, we introd...

Find SimilarView on arXiv

Extended Mind Transformers

June 4, 2024

89% Match
Phoebe Klett, Thomas Ahle
Machine Learning
Computation and Language

Pre-trained language models demonstrate general intelligence and common sense, but long inputs quickly become a bottleneck for memorizing information at inference time. We resurface a simple method, Memorizing Transformers (Wu et al., 2022), that gives the model access to a bank of pre-computed memories. We show that it is possible to fix many of the shortcomings of the original method, such as the need for fine-tuning, by critically assessing how positional encodings should ...

Find SimilarView on arXiv

On the Power of Convolution Augmented Transformer

July 8, 2024

89% Match
Mingchen Li, Xuechen Zhang, ... , Oymak Samet
Machine Learning
Computation and Language
Neural and Evolutionary Comp...

The transformer architecture has catalyzed revolutionary advances in language modeling. However, recent architectural recipes, such as state-space models, have bridged the performance gap. Motivated by this, we examine the benefits of Convolution-Augmented Transformer (CAT) for recall, copying, and length generalization tasks. CAT incorporates convolutional filters in the K/Q/V embeddings of an attention layer. Through CAT, we show that the locality of the convolution synergi...

Find SimilarView on arXiv

Retrieval Head Mechanistically Explains Long-Context Factuality

April 24, 2024

89% Match
Wenhao Wu, Yizhong Wang, Guangxuan Xiao, ... , Fu Yao
Computation and Language

Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing propertie...

Find SimilarView on arXiv

MATTER: Memory-Augmented Transformer Using Heterogeneous Knowledge Sources

June 7, 2024

89% Match
Dongkyu Lee, Chandana Satya Prakash, ... , Lehmann Jens
Computation and Language
Artificial Intelligence

Leveraging external knowledge is crucial for achieving high performance in knowledge-intensive tasks, such as question answering. The retrieve-and-read approach is widely adopted for integrating external knowledge into a language model. However, this approach suffers from increased computational cost and latency due to the long context length, which grows proportionally with the number of retrieved knowledge. Furthermore, existing retrieval-augmented models typically retrieve...

Find SimilarView on arXiv

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

June 29, 2020

89% Match
Angelos Katharopoulos, Apoorv Vyas, ... , Fleuret François
Machine Learning
Machine Learning

Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from $\mathcal{O}\left(N^2\right)$ to $\mathcal{O}\left(N\right)$, where $N$ is the sequence length...

Find SimilarView on arXiv

Human-like Episodic Memory for Infinite Context LLMs

July 12, 2024

89% Match
Zafeirios Fountas, Martin A Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, ... , Wang Jun
Artificial Intelligence
Computation and Language
Machine Learning
Neurons and Cognition

Large language models (LLMs) have shown remarkable capabilities, but still struggle with processing extensive contexts, limiting their ability to maintain coherence and accuracy over long sequences. In contrast, the human brain excels at organising and retrieving episodic experiences across vast temporal scales, spanning a lifetime. In this work, we introduce EM-LLM, a novel approach that integrates key aspects of human episodic memory and event cognition into LLMs, enabling ...

Find SimilarView on arXiv

Anchor-based Large Language Models

February 12, 2024

89% Match
Jianhui Pang, Fanghua Ye, ... , Wang Longyue
Computation and Language
Artificial Intelligence

Large language models (LLMs) predominantly employ decoder-only transformer architectures, necessitating the retention of keys/values information for historical tokens to provide contextual information and avoid redundant computation. However, the substantial size and parameter volume of these LLMs require massive GPU memory. This memory demand increases with the length of the input text, leading to an urgent need for more efficient methods of information storage and processin...

Find SimilarView on arXiv

Document-level Neural Machine Translation with Associated Memory Network

October 31, 2019

89% Match
Shu Jiang, Rui Wang, Zuchao Li, Masao Utiyama, Kehai Chen, Eiichiro Sumita, ... , Lu Bao-liang
Computation and Language

Standard neural machine translation (NMT) is on the assumption that the document-level context is independent. Most existing document-level NMT approaches are satisfied with a smattering sense of global document-level information, while this work focuses on exploiting detailed document-level context in terms of a memory network. The capacity of the memory network that detecting the most relevant part of the current sentence from memory renders a natural solution to model the ...

Find SimilarView on arXiv

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

May 22, 2020

89% Match
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, ... , Kiela Douwe
Computation and Language
Machine Learning

Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems...

Find SimilarView on arXiv