ID: 2404.02043

Ukrainian Texts Classification: Exploration of Cross-lingual Knowledge Transfer Approaches

April 2, 2024

View on ArXiv
Daryna Dementieva, Valeriia Khylenko, Georg Groh
Computer Science
Computation and Language
Artificial Intelligence

Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks -- toxicity classification, formality classification, and natural language inference -- providing the "recipe" for the optimal setups.

Similar papers 1

Toxicity Classification in Ukrainian

April 27, 2024

96% Match
Daryna Dementieva, Valeriia Khylenko, ... , Groh Georg
Computation and Language

The task of toxicity detection is still a relevant task, especially in the context of safe and fair LMs development. Nevertheless, labeled binary toxicity classification corpora are not available for all languages, which is understandable given the resource-intensive nature of the annotation process. Ukrainian, in particular, is among the languages lacking such resources. To our knowledge, there has been no existing toxicity classification corpus in Ukrainian. In this study, ...

Find SimilarView on arXiv

SmurfCat at PAN 2024 TextDetox: Alignment of Multilingual Transformers for Text Detoxification

July 7, 2024

90% Match
Elisei Rykov, Konstantin Zaytsev, ... , Voronin Alexandr
Computation and Language
Artificial Intelligence

This paper presents a solution for the Multilingual Text Detoxification task in the PAN-2024 competition of the SmurfCat team. Using data augmentation through machine translation and a special filtering procedure, we collected an additional multilingual parallel dataset for text detoxification. Using the obtained data, we fine-tuned several multilingual sequence-to-sequence models, such as mT0 and Aya, on a text detoxification task. We applied the ORPO alignment technique to ...

Find SimilarView on arXiv

From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language Representation

April 14, 2024

90% Match
Artur Kiulian, Anton Polishko, Mykola Khandoga, Oryna Chubych, Jack Connor, ... , Shirawalmath Adarsh
Computation and Language
Artificial Intelligence
Machine Learning

In the rapidly advancing field of AI and NLP, generative large language models (LLMs) stand at the forefront of innovation, showcasing unparalleled abilities in text understanding and generation. However, the limited representation of low-resource languages like Ukrainian poses a notable challenge, restricting the reach and relevance of this technology. Our paper addresses this by fine-tuning the open-source Gemma and Mistral LLMs with Ukrainian datasets, aiming to improve th...

Find SimilarView on arXiv

Universal Cross-Lingual Text Classification

June 16, 2024

89% Match
Riya Savant, Anushka Shelke, Sakshi Todmal, Sanskruti Kanphade, ... , Joshi Raviraj
Computation and Language
Machine Learning

Text classification, an integral task in natural language processing, involves the automatic categorization of text into predefined classes. Creating supervised labeled datasets for low-resource languages poses a considerable challenge. Unlocking the language potential of low-resource languages requires robust datasets with supervised labels. However, such datasets are scarce, and the label space is often limited. In our pursuit to address this gap, we aim to optimize existin...

Find SimilarView on arXiv

T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text Classification

June 8, 2023

89% Match
Inigo Jauregi Unanue, Gholamreza Haffari, Massimo Piccardi
Computation and Language

Cross-lingual text classification leverages text classifiers trained in a high-resource language to perform text classification in other languages with no or minimal fine-tuning (zero/few-shots cross-lingual transfer). Nowadays, cross-lingual text classifiers are typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest. However, the performance of these models vary significantly across languages and classification tas...

Find SimilarView on arXiv

Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher

October 6, 2020

89% Match
Giannis Karamanolakis, Daniel Hsu, Luis Gravano
Computation and Language

Cross-lingual text classification alleviates the need for manually labeled documents in a target language by leveraging labeled documents from other languages. Existing approaches for transferring supervision across languages require expensive cross-lingual resources, such as parallel corpora, while less expensive cross-lingual representation learning approaches train classifiers without target labeled documents. In this work, we propose a cross-lingual teacher-student method...

Find SimilarView on arXiv

Text Classification in the LLM Era -- Where do we stand?

February 17, 2025

89% Match
Sowmya Vajjala, Shwetali Shimangaud
Computation and Language

Large Language Models revolutionized NLP and showed dramatic performance improvements across several tasks. In this paper, we investigated the role of such language models in text classification and how they compare with other approaches relying on smaller pre-trained language models. Considering 32 datasets spanning 8 languages, we compared zero-shot classification, few-shot fine-tuning and synthetic data based classifiers with classifiers built using the complete human labe...

Find SimilarView on arXiv

Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text Classification

July 29, 2020

89% Match
Xin Dong, Yaxin Zhu, Yupeng Zhang, Zuohui Fu, Dongkuan Xu, ... , de Melo Gerard
Computation and Language

In cross-lingual text classification, one seeks to exploit labeled data from one language to train a text classification model that can then be applied to a completely different language. Recent multilingual representation models have made it much easier to achieve this. Still, there may still be subtle differences between languages that are neglected when doing so. To address this, we present a semi-supervised adversarial training process that minimizes the maximal loss for ...

Find SimilarView on arXiv

From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models

March 6, 2024

89% Match
Luiza Pozzobon, Patrick Lewis, ... , Ermis Beyza
Computation and Language
Artificial Intelligence

To date, toxicity mitigation in language models has almost entirely been focused on single-language settings. As language models embrace multilingual capabilities, it's crucial our safety measures keep pace. Recognizing this research gap, our approach expands the scope of conventional toxicity mitigation to address the complexities presented by multiple languages. In the absence of sufficient annotated datasets across languages, we employ translated data to evaluate and enhan...

Find SimilarView on arXiv

Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation

December 18, 2024

89% Match
Vera Neplenbroek, Arianna Bisazza, Raquel Fernández
Computation and Language

Recent generative large language models (LLMs) show remarkable performance in non-English languages, but when prompted in those languages they tend to express higher harmful social biases and toxicity levels. Prior work has shown that finetuning on specialized datasets can mitigate this behavior, and doing so in English can transfer to other languages. In this work, we investigate the impact of different finetuning methods on the model's bias and toxicity, but also on its abi...

Find SimilarView on arXiv