Universal Cross-Lingual Text Classificat...

MultiFiT: Efficient Multi-lingual Language Model Fine-tuning

September 10, 2019

90% Match

Julian Martin Eisenschlos, Sebastian Ruder, Piotr Czapla, Marcin Kardas, ... , Howard Jeremy

Computation and Language

Machine Learning

Pretrained language models are promising particularly for low-resource languages as they only require unlabelled data. However, training existing models requires huge amounts of compute, while pretrained cross-lingual models often underperform on low-resource languages. We propose Multi-lingual language model Fine-Tuning (MultiFiT) to enable practitioners to train and fine-tune language models efficiently in their own language. In addition, we propose a zero-shot method using...

Find SimilarView on arXiv

WC-SBERT: Zero-Shot Text Classification via SBERT with Self-Training for Wikipedia Categories

July 28, 2023

90% Match

Te-Yu Chi, Yu-Meng Tang, Chia-Wen Lu, ... , Jang Jyh-Shing Roger

Computation and Language

Artificial Intelligence

Our research focuses on solving the zero-shot text classification problem in NLP, with a particular emphasis on innovative self-training strategies. To achieve this objective, we propose a novel self-training strategy that uses labels rather than text for training, significantly reducing the model's training time. Specifically, we use categories from Wikipedia as our training set and leverage the SBERT pre-trained model to establish positive correlations between pairs of cate...

Find SimilarView on arXiv

L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT

April 22, 2023

90% Match

Samruddhi Deode, Janhavi Gadre, Aditi Kajale, ... , Joshi Raviraj

Computation and Language

Machine Learning

The multilingual Sentence-BERT (SBERT) models map different languages to common representation space and are useful for cross-language similarity and mining tasks. We propose a simple yet effective approach to convert vanilla multilingual BERT models into multilingual sentence BERT models using synthetic corpus. We simply aggregate translated NLI or STS datasets of the low-resource target languages together and perform SBERT-like fine-tuning of the vanilla multilingual BERT m...

Find SimilarView on arXiv

A Primer on Pretrained Multilingual Language Models

July 1, 2021

90% Match

Sumanth Doddapaneni, Gowtham Ramesh, Mitesh M. Khapra, ... , Kumar Pratyush

Computation and Language

Multilingual Language Models (\MLLMs) such as mBERT, XLM, XLM-R, \textit{etc.} have emerged as a viable option for bringing the power of pretraining to a large number of languages. Given their success in zero-shot transfer learning, there has emerged a large body of work in (i) building bigger \MLLMs~covering a large number of languages (ii) creating exhaustive benchmarks covering a wider variety of tasks and languages for evaluating \MLLMs~ (iii) analysing the performance of...

Find SimilarView on arXiv

A Corpus for Multilingual Document Classification in Eight Languages

May 24, 2018

90% Match

Holger Schwenk, Xian Li

Computation and Language

Cross-lingual document classification aims at training a document classifier on resources in one language and transferring it to a different language without any additional resources. Several approaches have been proposed in the literature and the current best practice is to evaluate them on a subset of the Reuters Corpus Volume 2. However, this subset covers only few languages (English, German, French and Spanish) and almost all published works focus on the the transfer betw...

Find SimilarView on arXiv

Cross-lingual Transfer for Text Classification with Dictionary-based Heterogeneous Graph

September 9, 2021

90% Match

Nuttapong Chairatanakul, Noppayut Sriwatanasakdi, Nontawat Charoenphakdee, ... , Murata Tsuyoshi

Computation and Language

Artificial Intelligence

Machine Learning

In cross-lingual text classification, it is required that task-specific training data in high-resource source languages are available, where the task is identical to that of a low-resource target language. However, collecting such training data can be infeasible because of the labeling cost, task characteristics, and privacy concerns. This paper proposes an alternative solution that uses only task-independent word embeddings of high-resource languages and bilingual dictionari...

Find SimilarView on arXiv

Cross-lingual Data Transformation and Combination for Text Classification

June 23, 2019

90% Match

Jun Jiang, Shumao Pang, Xia Zhao, Liwei Wang, Andrew Wen, ... , Feng Qianjin

Information Retrieval

Computation and Language

Text classification is a fundamental task for text data mining. In order to train a generalizable model, a large volume of text must be collected. To address data insufficiency, cross-lingual data may occasionally be necessary. Cross-lingual data sources may however suffer from data incompatibility, as text written in different languages can hold distinct word sequences and semantic patterns. Machine translation and word embedding alignment provide an effective way to transfo...

Find SimilarView on arXiv

Extending Multilingual BERT to Low-Resource Languages

April 28, 2020

90% Match

Zihan Wang, Karthikeyan K, ... , Roth Dan

Computation and Language

Multilingual BERT (M-BERT) has been a huge success in both supervised and zero-shot cross-lingual transfer learning. However, this success has focused only on the top 104 languages in Wikipedia that it was trained on. In this paper, we propose a simple but effective approach to extend M-BERT (E-BERT) so that it can benefit any new language, and show that our approach benefits languages that are already in M-BERT as well. We perform an extensive set of experiments with Named E...

Find SimilarView on arXiv

Combining Deep Generative Models and Multi-lingual Pretraining for Semi-supervised Document Classification

January 26, 2021

90% Match

Yi Zhu, Ehsan Shareghi, Yingzhen Li, ... , Korhonen Anna

Computation and Language

Semi-supervised learning through deep generative models and multi-lingual pretraining techniques have orchestrated tremendous success across different areas of NLP. Nonetheless, their development has happened in isolation, while the combination of both could potentially be effective for tackling task-specific labelled data shortage. To bridge this gap, we combine semi-supervised deep generative models and multi-lingual pretraining to form a pipeline for document classificatio...

Find SimilarView on arXiv

Revisiting Machine Translation for Cross-lingual Classification

May 23, 2023

90% Match

Mikel Artetxe, Vedanuj Goswami, Shruti Bhosale, ... , Zettlemoyer Luke

Computation and Language

Artificial Intelligence

Machine Learning

Machine Translation (MT) has been widely used for cross-lingual classification, either by translating the test set into English and running inference with a monolingual model (translate-test), or translating the training set into the target languages and finetuning a multilingual model (translate-train). However, most research in the area focuses on the multilingual models rather than the MT component. We show that, by using a stronger MT system and mitigating the mismatch be...

Find SimilarView on arXiv

Universal Cross-Lingual Text Classification

MultiFiT: Efficient Multi-lingual Language Model Fine-tuning

WC-SBERT: Zero-Shot Text Classification via SBERT with Self-Training for Wikipedia Categories

L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT

A Primer on Pretrained Multilingual Language Models

A Corpus for Multilingual Document Classification in Eight Languages

Cross-lingual Transfer for Text Classification with Dictionary-based Heterogeneous Graph

Cross-lingual Data Transformation and Combination for Text Classification

Extending Multilingual BERT to Low-Resource Languages

Combining Deep Generative Models and Multi-lingual Pretraining for Semi-supervised Document Classification

Revisiting Machine Translation for Cross-lingual Classification