February 27, 2020
Similar papers 2
March 23, 2021
In recent years, Natural Language Processing (NLP) models have achieved phenomenal success in linguistic and semantic tasks like text classification, machine translation, cognitive dialogue systems, information retrieval via Natural Language Understanding (NLU), and Natural Language Generation (NLG). This feat is primarily attributed due to the seminal Transformer architecture, leading to designs such as BERT, GPT (I, II, III), etc. Although these large-size models have achie...
October 9, 2019
Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. \textit{Transformers} is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered ...
April 30, 2024
The rapid progress of research aimed at interpreting the inner workings of advanced language models has highlighted a need for contextualizing the insights gained from years of work in this area. This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture. We conclude by presenting a comprehensive overview of the known internal me...
August 21, 2019
BERT-based architectures currently give state-of-the-art performance on many NLP tasks, but little is known about the exact mechanisms that contribute to its success. In the current work, we focus on the interpretation of self-attention, which is one of the fundamental underlying components of BERT. Using a subset of GLUE tasks and a set of handcrafted features-of-interest, we propose the methodology and carry out a qualitative and quantitative analysis of the information enc...
October 24, 2019
Large, self-supervised transformer-based language representation models have recently received significant amounts of attention, and have produced state-of-the-art results across a variety of tasks simply by scaling up pre-training on larger and larger corpora. Such models usually produce high dimensional vectors, on top of which additional task-specific layers and architectural modifications are added to adapt them to specific downstream tasks. Though there exists ample evid...
January 25, 2020
Although Bidirectional Encoder Representations from Transformers (BERT) have achieved tremendous success in many natural language processing (NLP) tasks, it remains a black box. A variety of previous works have tried to lift the veil of BERT and understand each layer's functionality. In this paper, we found that surprisingly the output layer of BERT can reconstruct the input sentence by directly taking each layer of BERT as input, even though the output layer has never seen t...
November 3, 2023
This paper introduces a novel modification of the transformer architecture, tailored for the data-efficient pretraining of language models. This aspect is evaluated by participating in the BabyLM challenge, where our solution won both the strict and strict-small tracks. Our approach allows each transformer layer to select which outputs of previous layers to process. The empirical results verify the potential of this simple modification and show that not all layers are equally...
July 16, 2023
Recent years have seen a phenomenal rise in performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained Transformer (GPT) and Vision Transformer (ViT), have shown their effectiveness across Natural Language Processing (NLP) and Computer Vision (CV) domains. Transformer-based networks such as ChatGPT have impacted the lives of common men. However,...
March 27, 2021
Natural Language Processing (NLP) has evolved significantly over the last decade. This paper highlights the most important milestones of this period while trying to pinpoint the contribution of each individual model and algorithm to the overall progress. Furthermore, it focuses on issues still remaining to be solved, emphasizing the groundbreaking proposals of Transformers, BERT, and all the similar attention-based models.
September 15, 2020
One of the challenges in the NLP field is training large classification models, a task that is both difficult and tedious. It is even harder when GPU hardware is unavailable. The increased availability of pre-trained and off-the-shelf word embeddings, models, and modules aim at easing the process of training large models and achieving a competitive performance. We explore the use of off-the-shelf BERT models and share the results of our experiments and compare their results t...