A Primer in BERTology: What we know abou...

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

October 11, 2018

95% Match

Jacob Devlin, Ming-Wei Chang, ... , Toutanova Kristina

Computation and Language

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide ra...

Find SimilarView on arXiv

Which *BERT? A Survey Organizing Contextualized Encoders

October 2, 2020

95% Match

Patrick Xia, Shijie Wu, Durme Benjamin Van

Computation and Language

Machine Learning

Pretrained contextualized text encoders are now a staple of the NLP community. We present a survey on language representation learning with the aim of consolidating a series of shared lessons learned across a variety of recent efforts. While significant advancements continue at a rapid pace, we find that enough has now been discovered, in different directions, that we can begin to organize advances according to common themes. Through this organization, we highlight important ...

Find SimilarView on arXiv

BERT: A Review of Applications in Natural Language Processing and Understanding

March 22, 2021

94% Match

M. V. Koroteev

Computation and Language

Artificial Intelligence

Machine Learning

In this review, we describe the application of one of the most popular deep learning-based language models - BERT. The paper describes the mechanism of operation of this model, the main areas of its application to the tasks of text analytics, comparisons with similar models in each task, as well as a description of some proprietary models. In preparing this review, the data of several dozen original scientific articles published over the past few years, which attracted the mo...

Find SimilarView on arXiv

Introduction to Transformers: an NLP Perspective

November 29, 2023

94% Match

Tong Xiao, Jingbo Zhu

Computation and Language

Artificial Intelligence

Machine Learning

Transformers have dominated empirical machine learning models of natural language processing. In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes a description of the standard Transformer architecture, a series of model refinements, and common applications. Given that Transformers and related deep learning techniques might be evolving in ways we have never seen, we cannot dive into ...

Find SimilarView on arXiv

Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey

November 1, 2021

94% Match

Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, ... , Roth Dan

Computation and Language

Artificial Intelligence

Machine Learning

Large, pre-trained transformer-based language models such as BERT have drastically changed the Natural Language Processing (NLP) field. We present a survey of recent work that uses these large language models to solve NLP tasks via pre-training then fine-tuning, prompting, or text generation approaches. We also present approaches that use pre-trained language models to generate data for training augmentation or other purposes. We conclude with discussions on limitations and s...

Find SimilarView on arXiv

How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations

September 11, 2019

94% Match

Aken Betty van, Benjamin Winter, ... , Gers Felix A.

Computation and Language

Information Retrieval

Bidirectional Encoder Representations from Transformers (BERT) reach state-of-the-art results in a variety of Natural Language Processing tasks. However, understanding of their internal functioning is still insufficient and unsatisfactory. In order to better understand BERT and other Transformer-based models, we present a layer-wise analysis of BERT's hidden states. Unlike previous research, which mainly focuses on explaining Transformer models by their attention weights, we ...

Find SimilarView on arXiv

What Does BERT Look At? An Analysis of BERT's Attention

June 11, 2019

94% Match

Kevin Clark, Urvashi Khandelwal, ... , Manning Christopher D.

Computation and Language

Large pre-trained neural networks such as BERT have had great recent success in NLP, motivating a growing body of research investigating what aspects of language they are able to learn from unlabeled data. Most recent analysis has focused on model outputs (e.g., language model surprisal) or internal vector representations (e.g., probing classifiers). Complementary to these works, we propose methods for analyzing the attention mechanisms of pre-trained models and apply them to...

Find SimilarView on arXiv

TrimBERT: Tailoring BERT for Trade-offs

February 24, 2022

94% Match

Sharath Nittur Sridhar, Anthony Sarah, Sairam Sundaresan

Computation and Language

Models based on BERT have been extremely successful in solving a variety of natural language processing (NLP) tasks. Unfortunately, many of these large models require a great deal of computational resources and/or time for pre-training and fine-tuning which limits wider adoptability. While self-attention layers have been well-studied, a strong justification for inclusion of the intermediate layers which follow them remains missing in the literature. In this work, we show that...

Find SimilarView on arXiv

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

February 27, 2020

94% Match

Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Hassan Sajjad, Preslav Nakov, ... , Winslett Marianne

Machine Learning

Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing (NLP) tasks. However, these models often have billions of parameters, and, thus, are too resource-hungry and computation-intensive to suit low-capability devices or applications with strict latency requirements. One potential remedy for this is model compression, which has attracted a lot of research attention. Here, we summarize the research in compressing ...

Find SimilarView on arXiv

A Comprehensive Comparison of Pre-training Language Models

June 22, 2021

93% Match

Tong Guo

Computation and Language

Recently, the development of pre-trained language models has brought natural language processing (NLP) tasks to the new state-of-the-art. In this paper we explore the efficiency of various pre-trained language models. We pre-train a list of transformer-based models with the same amount of text and the same training steps. The experimental results shows that the most improvement upon the origin BERT is adding the RNN-layer to capture more contextual information for short text ...

Find SimilarView on arXiv

A Primer in BERTology: What we know about how BERT works

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Which *BERT? A Survey Organizing Contextualized Encoders

BERT: A Review of Applications in Natural Language Processing and Understanding

Introduction to Transformers: an NLP Perspective

Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey

How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations

What Does BERT Look At? An Analysis of BERT's Attention

TrimBERT: Tailoring BERT for Trade-offs

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

A Comprehensive Comparison of Pre-training Language Models