ID: 2406.11028

Universal Cross-Lingual Text Classification

June 16, 2024

View on ArXiv
Riya Savant, Anushka Shelke, Sakshi Todmal, Sanskruti Kanphade, Ananya Joshi, Raviraj Joshi
Computer Science
Computation and Language
Machine Learning

Text classification, an integral task in natural language processing, involves the automatic categorization of text into predefined classes. Creating supervised labeled datasets for low-resource languages poses a considerable challenge. Unlocking the language potential of low-resource languages requires robust datasets with supervised labels. However, such datasets are scarce, and the label space is often limited. In our pursuit to address this gap, we aim to optimize existing labels/datasets in different languages. This research proposes a novel perspective on Universal Cross-Lingual Text Classification, leveraging a unified model across languages. Our approach involves blending supervised data from different languages during training to create a universal model. The supervised data for a target classification task might come from different languages covering different labels. The primary goal is to enhance label and language coverage, aiming for a label set that represents a union of labels from various languages. We propose the usage of a strong multilingual SBERT as our base model, making our novel training strategy feasible. This strategy contributes to the adaptability and effectiveness of the model in cross-lingual language transfer scenarios, where it can categorize text in languages not encountered during training. Thus, the paper delves into the intricacies of cross-lingual text classification, with a particular focus on its application for low-resource languages, exploring methodologies and implications for the development of a robust and adaptable universal cross-lingual model.

Similar papers 1

Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher

October 6, 2020

93% Match
Giannis Karamanolakis, Daniel Hsu, Luis Gravano
Computation and Language

Cross-lingual text classification alleviates the need for manually labeled documents in a target language by leveraging labeled documents from other languages. Existing approaches for transferring supervision across languages require expensive cross-lingual resources, such as parallel corpora, while less expensive cross-lingual representation learning approaches train classifiers without target labeled documents. In this work, we propose a cross-lingual teacher-student method...

Find SimilarView on arXiv

Bridging the domain gap in cross-lingual document classification

September 16, 2019

92% Match
Guokun Lai, Barlas Oguz, ... , Stoyanov Veselin
Computation and Language

The scarcity of labeled training data often prohibits the internationalization of NLP models to multiple languages. Recent developments in cross-lingual understanding (XLU) has made progress in this area, trying to bridge the language barrier using language universal representations. However, even if the language problem was resolved, models trained in one language would not transfer to another language perfectly due to the natural domain drift across languages and cultures. ...

Find SimilarView on arXiv

Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text Classification

July 29, 2020

92% Match
Xin Dong, Yaxin Zhu, Yupeng Zhang, Zuohui Fu, Dongkuan Xu, ... , de Melo Gerard
Computation and Language

In cross-lingual text classification, one seeks to exploit labeled data from one language to train a text classification model that can then be applied to a completely different language. Recent multilingual representation models have made it much easier to achieve this. Still, there may still be subtle differences between languages that are neglected when doing so. To address this, we present a semi-supervised adversarial training process that minimizes the maximal loss for ...

Find SimilarView on arXiv

Multilingual and cross-lingual document classification: A meta-learning approach

January 27, 2021

92% Match
der Heijden Niels van, Helen Yannakoudakis, ... , Shutova Ekaterina
Computation and Language

The great majority of languages in the world are considered under-resourced for the successful application of deep learning methods. In this work, we propose a meta-learning approach to document classification in limited-resource setting and demonstrate its effectiveness in two different settings: few-shot, cross-lingual adaptation to previously unseen languages; and multilingual joint training when limited target-language data is available during training. We conduct a syste...

Find SimilarView on arXiv

Can Monolingual Pretrained Models Help Cross-Lingual Classification?

November 10, 2019

92% Match
Zewen Chi, Li Dong, Furu Wei, ... , Huang Heyan
Computation and Language

Multilingual pretrained language models (such as multilingual BERT) have achieved impressive results for cross-lingual transfer. However, due to the constant model capacity, multilingual pre-training usually lags behind the monolingual competitors. In this work, we present two approaches to improve zero-shot cross-lingual classification, by transferring the knowledge from monolingual pretrained models to multilingual ones. Experimental results on two cross-lingual classificat...

Find SimilarView on arXiv

Cross-lingual Text Classification with Heterogeneous Graph Neural Network

May 24, 2021

91% Match
Ziyun Wang, Xuan Liu, Peiji Yang, ... , Wang Zhisheng
Computation and Language

Cross-lingual text classification aims at training a classifier on the source language and transferring the knowledge to target languages, which is very useful for low-resource languages. Recent multilingual pretrained language models (mPLM) achieve impressive results in cross-lingual classification tasks, but rarely consider factors beyond semantic similarity, causing performance degradation between some language pairs. In this paper we propose a simple yet effective method ...

Find SimilarView on arXiv

Cross-lingual Distillation for Text Classification

May 5, 2017

91% Match
Ruochen Xu, Yiming Yang
Computation and Language

Cross-lingual text classification(CLTC) is the task of classifying documents written in different languages into the same taxonomy of categories. This paper presents a novel approach to CLTC that builds on model distillation, which adapts and extends a framework originally proposed for model compression. Using soft probabilistic predictions for the documents in a label-rich language as the (induced) supervisory labels in a parallel corpus of documents, we train classifiers su...

Find SimilarView on arXiv

Expanding the Text Classification Toolbox with Cross-Lingual Embeddings

March 23, 2019

91% Match
Meryem M'hamdi, Robert West, Andreea Hossmann, ... , Musat Claudiu
Computation and Language

Most work in text classification and Natural Language Processing (NLP) focuses on English or a handful of other languages that have text corpora of hundreds of millions of words. This is creating a new version of the digital divide: the artificial intelligence (AI) divide. Transfer-based approaches, such as Cross-Lingual Text Classification (CLTC) - the task of categorizing texts written in different languages into a common taxonomy, are a promising solution to the emerging A...

Find SimilarView on arXiv

Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

December 22, 2018

91% Match
Mozhi Zhang, Yoshinari Fujinuma, Jordan Boyd-Graber
Computation and Language
Machine Learning

Text classification must sometimes be applied in a low-resource language with no labeled training data. However, training data may be available in a related language. We investigate whether character-level knowledge transfer from a related language helps text classification. We present a cross-lingual document classification framework (CACO) that exploits cross-lingual subword similarity by jointly training a character-based embedder and a word-based classifier. The embedder ...

Find SimilarView on arXiv

Cross-Lingual Text Classification with Multilingual Distillation and Zero-Shot-Aware Training

February 28, 2022

91% Match
Ziqing Yang, Yiming Cui, ... , Wang Shijin
Computation and Language

Multilingual pre-trained language models (MPLMs) not only can handle tasks in different languages but also exhibit surprising zero-shot cross-lingual transferability. However, MPLMs usually are not able to achieve comparable supervised performance on rich-resource languages compared to the state-of-the-art monolingual pre-trained models. In this paper, we aim to improve the multilingual model's supervised and zero-shot performance simultaneously only with the resources from s...

Find SimilarView on arXiv