Adding Recurrence to Pretrained Transfor...

RecycleGPT: An Autoregressive Language Model with Recyclable Module

August 7, 2023

92% Match

Yufan Jiang, Qiaozhi He, Xiaomin Zhuang, Zhihua Wu, Kunpeng Wang, ... , Yang Guangwen

Computation and Language

Artificial Intelligence

Existing large language models have to run K times to generate a sequence of K tokens. In this paper, we present RecycleGPT, a generative language model with fast decoding speed by recycling pre-generated model states without running the whole model in multiple steps. Our approach relies on the observation that adjacent tokens in a sequence usually have strong correlations and the next token in a sequence can be reasonably guessed or inferred based on the preceding ones. Expe...

Find SimilarView on arXiv

Language Models with Transformers

April 20, 2019

91% Match

Chenguang Wang, Mu Li, Alexander J. Smola

Computation and Language

Artificial Intelligence

Machine Learning

The Transformer architecture is superior to RNN-based models in computational efficiency. Recently, GPT and BERT demonstrate the efficacy of Transformer models on various NLP tasks using pre-trained language models on large-scale corpora. Surprisingly, these Transformer architectures are suboptimal for language model itself. Neither self-attention nor the positional encoding in the Transformer is able to efficiently incorporate the word-level sequential context crucial to lan...

Find SimilarView on arXiv

Scaling Transformer to 1M tokens and beyond with RMT

April 19, 2023

91% Match

Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev

Computation and Language

Artificial Intelligence

Machine Learning

This technical report presents the application of a recurrent memory to extend the context length of BERT, one of the most effective Transformer-based models in natural language processing. By leveraging the Recurrent Memory Transformer architecture, we have successfully increased the model's effective context length to an unprecedented two million tokens, while maintaining high memory retrieval accuracy. Our method allows for the storage and processing of both local and glob...

Find Similar View on arXiv

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

January 9, 2019

91% Match

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, ... , Salakhutdinov Ruslan

Machine Learning

Computation and Language

Machine Learning

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fra...

Find SimilarView on arXiv

Blockwise Parallel Transformer for Large Context Models

May 30, 2023

91% Match

Hao Liu, Pieter Abbeel

Computation and Language

Machine Learning

Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands posed by the self-attention mechanism and the large feedforward network in Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving multiple long sequences or long-term dependencies. We present a distinct approach, Blockwise P...

Find SimilarView on arXiv

Shortformer: Better Language Modeling using Shorter Inputs

December 31, 2020

91% Match

Ofir Press, Noah A. Smith, Mike Lewis

Computation and Language

Increasing the input length has been a driver of progress in language modeling with transformers. We identify conditions where shorter inputs are not harmful, and achieve perplexity and efficiency improvements through two new methods that decrease input length. First, we show that initially training a model on short subsequences before moving on to longer ones both reduces overall training time and, surprisingly, substantially improves perplexity. Second, we show how to impro...

Find SimilarView on arXiv

Current Limitations of Language Models: What You Need is Retrieval

September 15, 2020

91% Match

Aran Komatsuzaki

Computation and Language

Machine Learning

We classify and re-examine some of the current approaches to improve the performance-computes trade-off of language models, including (1) non-causal models (such as masked language models), (2) extension of batch length with efficient attention, (3) recurrence, (4) conditional computation and (5) retrieval. We identify some limitations (1) - (4) suffer from. For example, (1) currently struggles with open-ended text generation with the output loosely constrained by the input a...

Find SimilarView on arXiv

Finetuning Pretrained Transformers into RNNs

March 24, 2021

91% Match

Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, ... , Smith Noah A.

Computation and Language

Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. But this comes with a significant computational cost, as the attention mechanism's complexity scales quadratically with sequence length. Efficient transformer variants have received increasing interest in recent works. Among them, a linear-complexity recurrent variant has proven well suited for autoregressive generation. It approximates the softmax attention with randomized or heur...

Find SimilarView on arXiv

Simple Recurrence Improves Masked Language Models

May 23, 2022

91% Match

Tao Lei, Ran Tian, ... , Parikh Ankur P.

Computation and Language

Artificial Intelligence

In this work, we explore whether modeling recurrence into the Transformer architecture can both be beneficial and efficient, by building an extremely simple recurrent module into the Transformer. We compare our model to baselines following the training and evaluation recipe of BERT. Our results confirm that recurrence can indeed improve Transformer models by a consistent margin, without requiring low-level performance optimizations, and while keeping the number of parameters ...

Find SimilarView on arXiv

Pushdown Layers: Encoding Recursive Structure in Transformer Language Models

October 29, 2023

91% Match

Shikhar Murty, Pratyusha Sharma, ... , Manning Christopher D.

Computation and Language

Recursion is a prominent feature of human language, and fundamentally challenging for self-attention due to the lack of an explicit recursive-state tracking mechanism. Consequently, Transformer language models poorly capture long-tail recursive structure and exhibit sample-inefficient syntactic generalization. This work introduces Pushdown Layers, a new self-attention layer that models recursive state via a stack tape that tracks estimated depths of every token in an incremen...

Find SimilarView on arXiv

Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size

RecycleGPT: An Autoregressive Language Model with Recyclable Module

Language Models with Transformers

Scaling Transformer to 1M tokens and beyond with RMT

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Blockwise Parallel Transformer for Large Context Models

Shortformer: Better Language Modeling using Shorter Inputs

Current Limitations of Language Models: What You Need is Retrieval

Finetuning Pretrained Transformers into RNNs

Simple Recurrence Improves Masked Language Models

Pushdown Layers: Encoding Recursive Structure in Transformer Language Models