Adding Recurrence to Pretrained Transfor...

Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment

December 19, 2023

91% Match

Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, ... , Wang Fu Lee

Computation and Language

With the continuous growth in the number of parameters of transformer-based pretrained language models (PLMs), particularly the emergence of large language models (LLMs) with billions of parameters, many natural language processing (NLP) tasks have demonstrated remarkable success. However, the enormous size and computational demands of these models pose significant challenges for adapting them to specific downstream tasks, especially in environments with limited computational...

Find SimilarView on arXiv

Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

October 9, 2022

91% Match

Huanru Henry Mao

Machine Learning

Autoregressive Transformers are strong language models but incur O(T) complexity during per-token generation due to the self-attention mechanism. Recent work proposes kernel-based methods to approximate causal self-attention by replacing it with recurrent formulations with various update rules and feature maps to achieve O(1) time and memory complexity. We explore these approaches and find that they are unnecessarily complex, and propose a simple alternative - decaying fast w...

Find SimilarView on arXiv

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

May 25, 2023

91% Match

Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, ... , Hofmann Thomas

Computation and Language

Machine Learning

Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens in the sequence, thus incurring a quadratic cost. In this study, we present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness, resulting in reduced memory and computational requireme...

Find SimilarView on arXiv

Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning

March 28, 2023

91% Match

Vladislav Lialin, Vijeta Deshpande, Anna Rumshisky

Computation and Language

This paper presents a systematic overview and comparison of parameter-efficient fine-tuning methods covering over 40 papers published between February 2019 and February 2023. These methods aim to resolve the infeasibility and impracticality of fine-tuning large language models by only training a small set of parameters. We provide a taxonomy that covers a broad range of methods and present a detailed method comparison with a specific focus on real-life efficiency and fine-tun...

Find SimilarView on arXiv

Block Transformer: Global-to-Local Language Modeling for Fast Inference

June 4, 2024

91% Match

Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, ... , Yun Se-Young

Computation and Language

Artificial Intelligence

Machine Learning

This paper presents the Block Transformer architecture which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks of self-attention. To apply self-attention, the key-value (KV) cache of all previous sequences must be retrieved from memory at every decoding step. Thereby, this KV cache IO becomes a significant bottleneck in batch inference. We notice that these costs stem from applying self-attention on the global co...

Find SimilarView on arXiv

Trainable Transformer in Transformer

July 3, 2023

90% Match

Abhishek Panigrahi, Sadhika Malladi, ... , Arora Sanjeev

Computation and Language

Machine Learning

Recent works attribute the capability of in-context learning (ICL) in large pre-trained language models to implicitly simulating and fine-tuning an internal model (e.g., linear or 2-layer MLP) during inference. However, such constructions require large memory overhead, which makes simulation of more sophisticated internal models intractable. In this work, we propose an efficient construction, Transformer in Transformer (in short, TinT), that allows a transformer to simulate a...

Find SimilarView on arXiv

GPT vs RETRO: Exploring the Intersection of Retrieval and Parameter-Efficient Fine-Tuning

July 5, 2024

90% Match

Aleksander Ficek, Jiaqi Zeng, Oleksii Kuchaiev

Computation and Language

Artificial Intelligence

Information Retrieval

Machine Learning

Parameter-Efficient Fine-Tuning (PEFT) and Retrieval-Augmented Generation (RAG) have become popular methods for adapting large language models while minimizing compute requirements. In this paper, we apply PEFT methods (P-tuning, Adapters, and LoRA) to a modified Retrieval-Enhanced Transformer (RETRO) and a baseline GPT model across several sizes, ranging from 823 million to 48 billion parameters. We show that RETRO models outperform GPT models in zero-shot settings due to th...

Find SimilarView on arXiv

TransfoRNN: Capturing the Sequential Information in Self-Attention Representations for Language Modeling

April 4, 2021

90% Match

Tze Yuang Chong, Xuyang Wang, ... , Wang Junjie

Computation and Language

In this paper, we describe the use of recurrent neural networks to capture sequential information from the self-attention representations to improve the Transformers. Although self-attention mechanism provides a means to exploit long context, the sequential information, i.e. the arrangement of tokens, is not explicitly captured. We propose to cascade the recurrent neural networks to the Transformers, which referred to as the TransfoRNN model, to capture the sequential informa...

Find SimilarView on arXiv

Addressing Some Limitations of Transformers with Feedback Memory

February 21, 2020

90% Match

Angela Fan, Thibaut Lavril, Edouard Grave, ... , Sukhbaatar Sainbayar

Machine Learning

Computation and Language

Machine Learning

Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. Unlike recurrent neural networks, Transformers use attention to capture temporal relations while processing input tokens in parallel. While this parallelization makes them computationally efficient, it restricts the model from fully exploiting the sequential nature of the input. The representation at a given layer can only access representations from lower laye...

Find SimilarView on arXiv

Efficient Transformers: A Survey

September 14, 2020

90% Match

Yi Tay, Mostafa Dehghani, ... , Metzler Donald

Machine Learning

Artificial Intelligence

Computation and Language

Computer Vision and Pattern ...

Information Retrieval

Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. In the field of natural language processing for example, Transformers have become an indispensable staple in the modern deep learning stack. Recently, a dizzying number of "X-former" models have been proposed - Reformer, Linformer, Performer, Longformer, to name a few - which improve upon the original Tran...

Find SimilarView on arXiv

Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size

Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment

Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning

Block Transformer: Global-to-Local Language Modeling for Fast Inference

Trainable Transformer in Transformer

GPT vs RETRO: Exploring the Intersection of Retrieval and Parameter-Efficient Fine-Tuning

TransfoRNN: Capturing the Sequential Information in Self-Attention Representations for Language Modeling

Addressing Some Limitations of Transformers with Feedback Memory

Efficient Transformers: A Survey