Recurrent Memory Transformer

MELODI: Exploring Memory Compression for Long Contexts

October 4, 2024

88% Match

Yinpeng Chen, DeLesley Hutchins, Aren Jansen, Andrey Zhmoginov, ... , Andersen Jesper

Machine Learning

Artificial Intelligence

We present MELODI, a novel memory architecture designed to efficiently process long documents using short context windows. The key principle behind MELODI is to represent short-term and long-term memory as a hierarchical compression scheme across both network layers and context windows. Specifically, the short-term memory is achieved through recurrent compression of context windows across multiple layers, ensuring smooth transitions between windows. In contrast, the long-term...

Find SimilarView on arXiv

Is Attention All What You Need? -- An Empirical Investigation on Convolution-Based Active Memory and Self-Attention

December 27, 2019

88% Match

Thomas Dowdell, Hongyu Zhang

Machine Learning

Computation and Language

Machine Learning

The key to a Transformer model is the self-attention mechanism, which allows the model to analyze an entire sequence in a computationally efficient manner. Recent work has suggested the possibility that general attention mechanisms used by RNNs could be replaced by active-memory mechanisms. In this work, we evaluate whether various active-memory mechanisms could replace self-attention in a Transformer. Our experiments suggest that active-memory alone achieves comparable resul...

Find SimilarView on arXiv

Not All Memories are Created Equal: Learning to Forget by Expiring

May 13, 2021

88% Match

Sainbayar Sukhbaatar, Da Ju, Spencer Poff, Stephen Roller, Arthur Szlam, ... , Fan Angela

Machine Learning

Artificial Intelligence

Attention mechanisms have shown promising results in sequence modeling tasks that require long-term memory. Recent work investigated mechanisms to reduce the computational cost of preserving and storing memories. However, not all content in the past is equally important to remember. We propose Expire-Span, a method that learns to retain the most important information and expire the irrelevant information. This forgetting of memories enables Transformers to scale to attend ove...

Find SimilarView on arXiv

MoM: Linear Sequence Modeling with Mixture-of-Memories

February 19, 2025

88% Match

Jusen Du, Weigao Sun, Disen Lan, ... , Cheng Yu

Computation and Language

Artificial Intelligence

Machine Learning

Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive downstream tasks. Drawing inspiration from neuroscience, particularly the brain's ability to maintain robust long-term m...

Find SimilarView on arXiv

TransformerFAM: Feedback attention is working memory

April 14, 2024

88% Match

Dongseong Hwang, Weiran Wang, Zhuoyuan Huo, ... , Mengibar Pedro Moreno

Machine Learning

Artificial Intelligence

Computation and Language

While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no ad...

Find SimilarView on arXiv

Contextual Memory Reweaving in Large Language Models Using Layered Latent State Reconstruction

February 4, 2025

88% Match

Frederick Dillon, Gregor Halvorsen, Simon Tattershall, ... , Vanderpool Gareth

Computation and Language

Memory retention challenges in deep neural architectures have ongoing limitations in the ability to process and recall extended contextual information. Token dependencies degrade as sequence length increases, leading to a decline in coherence and factual consistency across longer outputs. A structured approach is introduced to mitigate this issue through the reweaving of latent states captured at different processing layers, reinforcing token representations over extended seq...

Find SimilarView on arXiv

Learning Memory Mechanisms for Decision Making through Demonstrations

November 12, 2024

88% Match

William Yue, Bo Liu, Peter Stone

Machine Learning

Robotics

In Partially Observable Markov Decision Processes, integrating an agent's history into memory poses a significant challenge for decision-making. Traditional imitation learning, relying on observation-action pairs for expert demonstrations, fails to capture the expert's memory mechanisms used in decision-making. To capture memory processes as demonstrations, we introduce the concept of memory dependency pairs $(p, q)$ indicating that events at time $p$ are recalled for decisio...

Find SimilarView on arXiv

MemoryPrompt: A Light Wrapper to Improve Context Tracking in Pre-trained Language Models

February 23, 2024

88% Match

Nathanaël Carraz Rakotonirina, Marco Baroni

Computation and Language

Artificial Intelligence

Machine Learning

Transformer-based language models (LMs) track contextual information through large, hard-coded input windows. We introduce MemoryPrompt, a leaner approach in which the LM is complemented by a small auxiliary recurrent network that passes information to the LM by prefixing its regular input with a sequence of vectors, akin to soft prompts, without requiring LM finetuning. Tested on a task designed to probe a LM's ability to keep track of multiple fact updates, a MemoryPrompt-a...

Find SimilarView on arXiv

Autonomous Structural Memory Manipulation for Large Language Models Using Hierarchical Embedding Augmentation

January 23, 2025

88% Match

Derek Yotheringhay, Alistair Kirkland, ... , Whitesteeple Josiah

Computation and Language

Artificial Intelligence

Transformative innovations in model architectures have introduced hierarchical embedding augmentation as a means to redefine the representation of tokens through multi-level semantic structures, offering enhanced adaptability to complex linguistic inputs. Autonomous structural memory manipulation further advances this paradigm through dynamic memory reallocation mechanisms that prioritize critical contextual features while suppressing less relevant information, enabling scala...

Find SimilarView on arXiv

Space Time Recurrent Memory Network

September 14, 2021

88% Match

Hung Nguyen, Chanho Kim, Fuxin Li

Computer Vision and Pattern ...

Artificial Intelligence

Transformers have recently been popular for learning and inference in the spatial-temporal domain. However, their performance relies on storing and applying attention to the feature tensor of each frame in video. Hence, their space and time complexity increase linearly as the length of video grows, which could be very costly for long videos. We propose a novel visual memory network architecture for the learning and inference problem in the spatial-temporal domain. We maintain...

Find SimilarView on arXiv