ID: 2405.06067

HMT: Hierarchical Memory Transformer for Long Context Language Processing

May 9, 2024

View on ArXiv

Similar papers 2

Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs

April 16, 2024

90% Match
Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, ... , Shin Jinwoo
Machine Learning
Artificial Intelligence

Large language models (LLMs) have shown remarkable performance in various natural language processing tasks. However, a primary constraint they face is the context limit, i.e., the maximum number of tokens they can process. Previous works have explored architectural changes and modifications in positional encoding to relax the constraint, but they often require expensive training or do not address the computational demands of self-attention. In this paper, we present Hierarch...

Find SimilarView on arXiv

Associative Recurrent Memory Transformer

July 5, 2024

90% Match
Ivan Rodkin, Yuri Kuratov, ... , Burtsev Mikhail
Computation and Language
Artificial Intelligence
Machine Learning

This paper addresses the challenge of creating a neural architecture for very long sequences that requires constant time for processing new information at each time step. Our approach, Associative Recurrent Memory Transformer (ARMT), is based on transformer self-attention for local context and segment-level recurrence for storage of task specific information distributed over a long context. We demonstrate that ARMT outperfors existing alternatives in associative retrieval tas...

Find SimilarView on arXiv

MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory

April 17, 2024

90% Match
Ali Modarressi, Abdullatif Köksal, Ayyoob Imani, ... , Schütze Hinrich
Computation and Language

While current large language models (LLMs) demonstrate some capabilities in knowledge-intensive tasks, they are limited by relying on their parameters as an implicit storage mechanism. As a result, they struggle with infrequent knowledge and temporal degradation. In addition, the uninterpretable nature of parametric memorization makes it challenging to understand and prevent hallucination. Parametric memory pools and model editing are only partial solutions. Retrieval Augment...

Find SimilarView on arXiv

QRMeM: Unleash the Length Limitation through Question then Reflection Memory Mechanism

June 19, 2024

90% Match
Bo Wang, Heyan Huang, Yixin Cao, Jiahao Ying, ... , Feng Chong
Computation and Language

While large language models (LLMs) have made notable advancements in natural language processing, they continue to struggle with processing extensive text. Memory mechanism offers a flexible solution for managing long contexts, utilizing techniques such as compression, summarization, and structuring to facilitate nuanced and efficient handling of large volumes of text. However, existing techniques face challenges with static knowledge integration, leading to insufficient adap...

Find SimilarView on arXiv

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

January 9, 2019

90% Match
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, ... , Salakhutdinov Ruslan
Machine Learning
Computation and Language
Machine Learning

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fra...

Find SimilarView on arXiv

Language Model Pre-training for Hierarchical Document Representations

January 26, 2019

90% Match
Ming-Wei Chang, Kristina Toutanova, ... , Devlin Jacob
Computation and Language

Hierarchical neural architectures are often used to capture long-distance dependencies and have been applied to many document-level tasks such as summarization, document segmentation, and sentiment analysis. However, effective usage of such a large context can be difficult to learn, especially in the case where there is limited labeled data available. Building on the recent success of language model pretraining methods for learning flat representations of text, we propose alg...

Find SimilarView on arXiv

MATTER: Memory-Augmented Transformer Using Heterogeneous Knowledge Sources

June 7, 2024

90% Match
Dongkyu Lee, Chandana Satya Prakash, ... , Lehmann Jens
Computation and Language
Artificial Intelligence

Leveraging external knowledge is crucial for achieving high performance in knowledge-intensive tasks, such as question answering. The retrieve-and-read approach is widely adopted for integrating external knowledge into a language model. However, this approach suffers from increased computational cost and latency due to the long context length, which grows proportionally with the number of retrieved knowledge. Furthermore, existing retrieval-augmented models typically retrieve...

Find SimilarView on arXiv

Memorizing Documents with Guidance in Large Language Models

June 23, 2024

90% Match
Bumjin Park, Jaesik Choi
Computation and Language
Artificial Intelligence

Training data plays a pivotal role in AI models. Large language models (LLMs) are trained with massive amounts of documents, and their parameters hold document-related contents. Recently, several studies identified content-specific locations in LLMs by examining the parameters. Instead of the post hoc interpretation, we propose another approach. We propose document-wise memory architecture to track document memories in training. The proposed architecture maps document represe...

Find SimilarView on arXiv

Hierarchical Learning for Generation with Long Source Sequences

April 15, 2021

90% Match
Tobias Rohde, Xiaoxia Wu, Yinhan Liu
Computation and Language

One of the challenges for current sequence to sequence (seq2seq) models is processing long sequences, such as those in summarization and document level machine translation tasks. These tasks require the model to reason at the token level as well as the sentence and paragraph level. We design and study a new Hierarchical Attention Transformer-based architecture (HAT) that outperforms standard Transformers on several sequence to sequence tasks. Furthermore, our model achieves s...

Find SimilarView on arXiv

GMAT: Global Memory Augmentation for Transformers

June 5, 2020

90% Match
Ankit Gupta, Jonathan Berant
Machine Learning
Computation and Language
Machine Learning

Transformer-based models have become ubiquitous in natural language processing thanks to their large capacity, innate parallelism and high performance. The contextualizing component of a Transformer block is the $\textit{pairwise dot-product}$ attention that has a large $\Omega(L^2)$ memory requirement for length $L$ sequences, limiting its ability to process long documents. This has been the subject of substantial interest recently, where multiple approximations were propose...

Find SimilarView on arXiv