Fine-tuning Encoders for Improved Monolingual and Zero-shot Polylingual Neural Topic Modeling

April 11, 2021

Aaron Mueller, Mark Dredze

Computer Science

Computation and Language

Neural topic models can augment or replace bag-of-words inputs with the learned representations of deep pre-trained transformer-based word prediction models. One added benefit when using representations from multilingual models is that they facilitate zero-shot polylingual topic modeling. However, while it has been widely observed that pre-trained embeddings should be fine-tuned to a given task, it is not immediately clear what supervision should look like for an unsupervised task such as topic modeling. Thus, we propose several methods for fine-tuning encoders to improve both monolingual and zero-shot polylingual neural topic modeling. We consider fine-tuning on auxiliary tasks, constructing a new topic classification task, integrating the topic classification objective directly into topic model training, and continued pre-training. We find that fine-tuning encoder representations on topic classification and integrating the topic classification task directly into topic modeling improves topic quality, and that fine-tuning encoder representations on any task is the most important factor for facilitating cross-lingual transfer.

Topic Modeling with Fine-tuning LLMs and Bag of Sentences

August 6, 2024

92% Match

Johannes Schneider

Computation and Language

Machine Learning

Large language models (LLM)'s are increasingly used for topic modeling outperforming classical topic models such as LDA. Commonly, pre-trained LLM encoders such as BERT are used out-of-the-box despite the fact that fine-tuning is known to improve LLMs considerably. The challenge lies in obtaining a suitable (labeled) dataset for fine-tuning. In this paper, we use the recent idea to use bag of sentences as the elementary unit in computing topics. In turn, we derive an approach...

Find SimilarView on arXiv

Cross-lingual Contextualized Topic Models with Zero-shot Learning

April 16, 2020

91% Match

Federico Bianchi, Silvia Terragni, Dirk Hovy, ... , Fersini Elisabetta

Computation and Language

Many data sets (e.g., reviews, forums, news, etc.) exist parallelly in multiple languages. They all cover the same content, but the linguistic differences make it impossible to use traditional, bag-of-word-based topic models. Models have to be either single-language or suffer from a huge, but extremely sparse vocabulary. Both issues can be addressed by transfer learning. In this paper, we introduce a zero-shot cross-lingual topic model. Our model learns topics on one language...

Find SimilarView on arXiv

Can Monolingual Pretrained Models Help Cross-Lingual Classification?

November 10, 2019

91% Match

Zewen Chi, Li Dong, Furu Wei, ... , Huang Heyan

Computation and Language

Multilingual pretrained language models (such as multilingual BERT) have achieved impressive results for cross-lingual transfer. However, due to the constant model capacity, multilingual pre-training usually lags behind the monolingual competitors. In this work, we present two approaches to improve zero-shot cross-lingual classification, by transferring the knowledge from monolingual pretrained models to multilingual ones. Experimental results on two cross-lingual classificat...

Find SimilarView on arXiv

Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence

April 8, 2020

91% Match

Federico Bianchi, Silvia Terragni, Dirk Hovy

Computation and Language

Topic models extract groups of words from documents, whose interpretation as a topic hopefully allows for a better understanding of the data. However, the resulting word groups are often not coherent, making them harder to interpret. Recently, neural topic models have shown improvements in overall coherence. Concurrently, contextual embeddings have advanced the state of the art of neural models in general. In this paper, we combine contextualized representations with neural t...

Find SimilarView on arXiv

Improving Neural Topic Models using Knowledge Distillation

October 5, 2020

91% Match

Alexander Hoyle, Pranav Goel, Philip Resnik

Computation and Language

Information Retrieval

Machine Learning

Topic models are often used to identify human-interpretable topics to help make sense of large document collections. We use knowledge distillation to combine the best attributes of probabilistic topic models and pretrained transformers. Our modular method can be straightforwardly applied with any neural topic model to improve topic quality, which we demonstrate using two models having disparate architectures, obtaining state-of-the-art topic coherence. We show that our adapta...

Find SimilarView on arXiv

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

March 11, 2022

90% Match

Maarten Grootendorst

Computation and Language

Topic models can be useful tools to discover latent topics in collections of documents. Recent studies have shown the feasibility of approach topic modeling as a clustering task. We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of TF-IDF. More specifically, BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings...

Find SimilarView on arXiv

Exploring Fine-tuning Techniques for Pre-trained Cross-lingual Models via Continual Learning

April 29, 2020

90% Match

Zihan Liu, Genta Indra Winata, ... , Fung Pascale

Computation and Language

Machine Learning

Recently, fine-tuning pre-trained language models (e.g., multilingual BERT) to downstream cross-lingual tasks has shown promising results. However, the fine-tuning process inevitably changes the parameters of the pre-trained model and weakens its cross-lingual ability, which leads to sub-optimal performance. To alleviate this problem, we leverage continual learning to preserve the original cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks...

Find SimilarView on arXiv

An Empirical Study on Crosslingual Transfer in Probabilistic Topic Models

October 13, 2018

90% Match

Shudong Hao, Michael J. Paul

Computation and Language

Probabilistic topic modeling is a popular choice as the first step of crosslingual tasks to enable knowledge transfer and extract multilingual features. While many multilingual topic models have been developed, their assumptions on the training corpus are quite varied, and it is not clear how well the models can be applied under various training conditions. In this paper, we systematically study the knowledge transfer mechanisms behind different multilingual topic models, and...

Find SimilarView on arXiv

How Do Multilingual Encoders Learn Cross-lingual Representation?

July 12, 2022

90% Match

Shijie Wu

Computation and Language

NLP systems typically require support for more than one language. As different languages have different amounts of supervision, cross-lingual transfer benefits languages with little to no training data by transferring from other languages. From an engineering perspective, multilingual NLP benefits development and maintenance by serving multiple languages with a single system. Both cross-lingual transfer and multilingual NLP rely on cross-lingual representations serving as the...

Find SimilarView on arXiv

Probabilistic Topic Modelling with Transformer Representations

March 6, 2024

90% Match

Arik Reuter, Anton Thielmann, Christoph Weisser, ... , Kneib Thomas

Machine Learning

Computation and Language

Topic modelling was mostly dominated by Bayesian graphical models during the last decade. With the rise of transformers in Natural Language Processing, however, several successful models that rely on straightforward clustering approaches in transformer-based embedding spaces have emerged and consolidated the notion of topics as clusters of embedding vectors. We propose the Transformer-Representation Neural Topic Model (TNTM), which combines the benefits of topic representatio...

Find SimilarView on arXiv