ID: 2404.02043

Ukrainian Texts Classification: Exploration of Cross-lingual Knowledge Transfer Approaches

April 2, 2024

View on ArXiv
Daryna Dementieva, Valeriia Khylenko, Georg Groh
Computer Science
Computation and Language
Artificial Intelligence

Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks -- toxicity classification, formality classification, and natural language inference -- providing the "recipe" for the optimal setups.

Similar papers 1

Toxicity Classification in Ukrainian

April 27, 2024

96% Match
Daryna Dementieva, Valeriia Khylenko, ... , Groh Georg
Computation and Language

The task of toxicity detection is still a relevant task, especially in the context of safe and fair LMs development. Nevertheless, labeled binary toxicity classification corpora are not available for all languages, which is understandable given the resource-intensive nature of the annotation process. Ukrainian, in particular, is among the languages lacking such resources. To our knowledge, there has been no existing toxicity classification corpus in Ukrainian. In this study, ...

Find SimilarView on arXiv

SmurfCat at PAN 2024 TextDetox: Alignment of Multilingual Transformers for Text Detoxification

July 7, 2024

90% Match
Elisei Rykov, Konstantin Zaytsev, ... , Voronin Alexandr
Computation and Language
Artificial Intelligence

This paper presents a solution for the Multilingual Text Detoxification task in the PAN-2024 competition of the SmurfCat team. Using data augmentation through machine translation and a special filtering procedure, we collected an additional multilingual parallel dataset for text detoxification. Using the obtained data, we fine-tuned several multilingual sequence-to-sequence models, such as mT0 and Aya, on a text detoxification task. We applied the ORPO alignment technique to ...

Find SimilarView on arXiv

From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language Representation

April 14, 2024

90% Match
Artur Kiulian, Anton Polishko, Mykola Khandoga, Oryna Chubych, Jack Connor, ... , Shirawalmath Adarsh
Computation and Language
Artificial Intelligence
Machine Learning

In the rapidly advancing field of AI and NLP, generative large language models (LLMs) stand at the forefront of innovation, showcasing unparalleled abilities in text understanding and generation. However, the limited representation of low-resource languages like Ukrainian poses a notable challenge, restricting the reach and relevance of this technology. Our paper addresses this by fine-tuning the open-source Gemma and Mistral LLMs with Ukrainian datasets, aiming to improve th...

Find SimilarView on arXiv

Universal Cross-Lingual Text Classification

June 16, 2024

89% Match
Riya Savant, Anushka Shelke, Sakshi Todmal, Sanskruti Kanphade, ... , Joshi Raviraj
Computation and Language
Machine Learning

Text classification, an integral task in natural language processing, involves the automatic categorization of text into predefined classes. Creating supervised labeled datasets for low-resource languages poses a considerable challenge. Unlocking the language potential of low-resource languages requires robust datasets with supervised labels. However, such datasets are scarce, and the label space is often limited. In our pursuit to address this gap, we aim to optimize existin...

Find SimilarView on arXiv

T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text Classification

June 8, 2023

89% Match
Inigo Jauregi Unanue, Gholamreza Haffari, Massimo Piccardi
Computation and Language

Cross-lingual text classification leverages text classifiers trained in a high-resource language to perform text classification in other languages with no or minimal fine-tuning (zero/few-shots cross-lingual transfer). Nowadays, cross-lingual text classifiers are typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest. However, the performance of these models vary significantly across languages and classification tas...

Find SimilarView on arXiv

Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher

October 6, 2020

89% Match
Giannis Karamanolakis, Daniel Hsu, Luis Gravano
Computation and Language

Cross-lingual text classification alleviates the need for manually labeled documents in a target language by leveraging labeled documents from other languages. Existing approaches for transferring supervision across languages require expensive cross-lingual resources, such as parallel corpora, while less expensive cross-lingual representation learning approaches train classifiers without target labeled documents. In this work, we propose a cross-lingual teacher-student method...

Find SimilarView on arXiv

Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text Classification

July 29, 2020

89% Match
Xin Dong, Yaxin Zhu, Yupeng Zhang, Zuohui Fu, Dongkuan Xu, ... , de Melo Gerard
Computation and Language

In cross-lingual text classification, one seeks to exploit labeled data from one language to train a text classification model that can then be applied to a completely different language. Recent multilingual representation models have made it much easier to achieve this. Still, there may still be subtle differences between languages that are neglected when doing so. To address this, we present a semi-supervised adversarial training process that minimizes the maximal loss for ...

Find SimilarView on arXiv

From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models

March 6, 2024

89% Match
Luiza Pozzobon, Patrick Lewis, ... , Ermis Beyza
Computation and Language
Artificial Intelligence

To date, toxicity mitigation in language models has almost entirely been focused on single-language settings. As language models embrace multilingual capabilities, it's crucial our safety measures keep pace. Recognizing this research gap, our approach expands the scope of conventional toxicity mitigation to address the complexities presented by multiple languages. In the absence of sufficient annotated datasets across languages, we employ translated data to evaluate and enhan...

Find SimilarView on arXiv

Few-shot learning for automated content analysis: Efficient coding of arguments and claims in the debate on arms deliveries to Ukraine

December 28, 2023

88% Match
Jonas Rieger, Kostiantyn Yanchenko, Mattes Ruckdeschel, Nordheim Gerret von, ... , Wiedemann Gregor
Computation and Language
Machine Learning

Pre-trained language models (PLM) based on transformer neural networks developed in the field of natural language processing (NLP) offer great opportunities to improve automatic content analysis in communication science, especially for the coding of complex semantic categories in large datasets via supervised machine learning. However, three characteristics so far impeded the widespread adoption of the methods in the applying disciplines: the dominance of English language mod...

Find SimilarView on arXiv

Smart Expert System: Large Language Models as Text Classifiers

May 17, 2024

88% Match
Zhiqiang Wang, Yiran Pang, Yanbin Lin
Computation and Language

Text classification is a fundamental task in Natural Language Processing (NLP), and the advent of Large Language Models (LLMs) has revolutionized the field. This paper introduces the Smart Expert System, a novel approach that leverages LLMs as text classifiers. The system simplifies the traditional text classification workflow, eliminating the need for extensive preprocessing and domain expertise. The performance of several LLMs, machine learning (ML) algorithms, and neural n...

Find SimilarView on arXiv