Fine-tuning Encoders for Improved Monoli...

KDSTM: Neural Semi-supervised Topic Modeling with Knowledge Distillation

July 4, 2023

89% Match

Weijie Xu, Xiaoyu Jiang, Jay Desai, Bin Han, ... , Iannacci Francis

Computation and Language

Artificial Intelligence

In text classification tasks, fine tuning pretrained language models like BERT and GPT-3 yields competitive accuracy; however, both methods require pretraining on large text datasets. In contrast, general topic modeling methods possess the advantage of analyzing documents to extract meaningful patterns of words without the need of pretraining. To leverage topic modeling's unsupervised insights extraction on text classification tasks, we develop the Knowledge Distillation Semi...

Find SimilarView on arXiv

On Cross-Lingual Retrieval with Multilingual Text Encoders

December 21, 2021

89% Match

Robert Litschko, Ivan Vulić, ... , Glavaš Goran

Computation and Language

Information Retrieval

In this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a number of diverse language pairs. We first treat these models as multilingual text encoders and benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR. In contrast to supervised language understanding, our results indicate that for unsupervised document-level...

Find SimilarView on arXiv

Understanding The Robustness of Self-supervised Learning Through Topic Modeling

February 2, 2022

89% Match

Zeping Luo, Shiyou Wu, Cindy Weng, ... , Ge Rong

Computation and Language

Machine Learning

Self-supervised learning has significantly improved the performance of many NLP tasks. However, how can self-supervised learning discover useful representations, and why is it better than traditional approaches such as probabilistic models are still largely unknown. In this paper, we focus on the context of topic modeling and highlight a key advantage of self-supervised learning - when applied to data generated by topic models, self-supervised learning can be oblivious to the...

Find SimilarView on arXiv

A Simple and Effective Method to Improve Zero-Shot Cross-Lingual Transfer Learning

October 18, 2022

89% Match

Kunbo Ding, Weijie Liu, Yuejian Fang, Weiquan Mao, Zhe Zhao, Tao Zhu, Haoyan Liu, ... , Chen Yiren

Computation and Language

Existing zero-shot cross-lingual transfer methods rely on parallel corpora or bilingual dictionaries, which are expensive and impractical for low-resource languages. To disengage from these dependencies, researchers have explored training multilingual models on English-only resources and transferring them to low-resource languages. However, its effect is limited by the gap between embedding clusters of different languages. To address this issue, we propose Embedding-Push, Att...

Find SimilarView on arXiv

Multi-source Neural Topic Modeling in Multi-view Embedding Spaces

April 17, 2021

89% Match

Pankaj Gupta, Yatin Chaudhary, Hinrich Schütze

Computation and Language

Artificial Intelligence

Machine Learning

Though word embeddings and topics are complementary representations, several past works have only used pretrained word embeddings in (neural) topic modeling to address data sparsity in short-text or small collection of documents. This work presents a novel neural topic modeling framework using multi-view embedding spaces: (1) pretrained topic-embeddings, and (2) pretrained word-embeddings (context insensitive from Glove and context-sensitive from BERT models) jointly from one...

Find SimilarView on arXiv

ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling

January 4, 2022

89% Match

Alexandre Alcoforado, Thomas Palmeira Ferraz, Rodrigo Gerber, Enzo Bustos, André Seidel Oliveira, Bruno Miguel Veloso, ... , Costa Anna Helena Reali

Computation and Language

Artificial Intelligence

Machine Learning

Traditional text classification approaches often require a good amount of labeled data, which is difficult to obtain, especially in restricted domains or less widespread languages. This lack of labeled data has led to the rise of low-resource methods, that assume low data availability in natural language processing. Among them, zero-shot learning stands out, which consists of learning a classifier without any previously labeled data. The best results reported with this approa...

Find SimilarView on arXiv

Realistic Zero-Shot Cross-Lingual Transfer in Legal Topic Classification

June 8, 2022

89% Match

Stratos Xenouleas, Alexia Tsoukara, Giannis Panagiotakis, ... , Androutsopoulos Ion

Computation and Language

We consider zero-shot cross-lingual transfer in legal topic classification using the recent MultiEURLEX dataset. Since the original dataset contains parallel documents, which is unrealistic for zero-shot cross-lingual transfer, we develop a new version of the dataset without parallel documents. We use it to show that translation-based methods vastly outperform cross-lingual fine-tuning of multilingually pre-trained models, the best previous zero-shot transfer method for Multi...

Find SimilarView on arXiv

A Primer on Pretrained Multilingual Language Models

July 1, 2021

89% Match

Sumanth Doddapaneni, Gowtham Ramesh, Mitesh M. Khapra, ... , Kumar Pratyush

Computation and Language

Multilingual Language Models (\MLLMs) such as mBERT, XLM, XLM-R, \textit{etc.} have emerged as a viable option for bringing the power of pretraining to a large number of languages. Given their success in zero-shot transfer learning, there has emerged a large body of work in (i) building bigger \MLLMs~covering a large number of languages (ii) creating exhaustive benchmarks covering a wider variety of tasks and languages for evaluating \MLLMs~ (iii) analysing the performance of...

Find SimilarView on arXiv

Polyglot Prompt: Multilingual Multitask PrompTraining

April 29, 2022

89% Match

Jinlan Fu, See-Kiong Ng, Pengfei Liu

Computation and Language

This paper aims for a potential architectural improvement for multilingual learning and asks: Can different tasks from different languages be modeled in a monolithic framework, i.e. without any task/language-specific module? The benefit of achieving this could open new doors for future multilingual research, including allowing systems trained on low resources to be further assisted by other languages as well as other tasks. We approach this goal by developing a learning frame...

Find SimilarView on arXiv

Multi-view and Multi-source Transfers in Neural Topic Modeling with Pretrained Topic and Word Embeddings

September 14, 2019

89% Match

Pankaj Gupta, Yatin Chaudhary, Hinrich Schütze

Computation and Language

Information Retrieval

Machine Learning

Though word embeddings and topics are complementary representations, several past works have only used pre-trained word embeddings in (neural) topic modeling to address data sparsity problem in short text or small collection of documents. However, no prior work has employed (pre-trained latent) topics in transfer learning paradigm. In this paper, we propose an approach to (1) perform knowledge transfer using latent topics obtained from a large source corpus, and (2) jointly t...

Find SimilarView on arXiv

Fine-tuning Encoders for Improved Monolingual and Zero-shot Polylingual Neural Topic Modeling

KDSTM: Neural Semi-supervised Topic Modeling with Knowledge Distillation

On Cross-Lingual Retrieval with Multilingual Text Encoders

Understanding The Robustness of Self-supervised Learning Through Topic Modeling

A Simple and Effective Method to Improve Zero-Shot Cross-Lingual Transfer Learning

Multi-source Neural Topic Modeling in Multi-view Embedding Spaces

ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling

Realistic Zero-Shot Cross-Lingual Transfer in Legal Topic Classification

A Primer on Pretrained Multilingual Language Models

Polyglot Prompt: Multilingual Multitask PrompTraining

Multi-view and Multi-source Transfers in Neural Topic Modeling with Pretrained Topic and Word Embeddings