Monolingual and Cross-Lingual Knowledge Transfer for Topic Classification

June 13, 2023

RuSentEval: Linguistic Source, Encoder Force!

February 28, 2021

88% Match

Vladislav Mikhailov, Ekaterina Taktasheva, ... , Artemova Ekaterina

Computation and Language

The success of pre-trained transformer language models has brought a great deal of interest on how these models work, and what they learn about language. However, prior research in the field is mainly devoted to English, and little is known regarding other languages. To this end, we introduce RuSentEval, an enhanced set of 14 probing tasks for Russian, including ones that have not been explored yet. We apply a combination of complementary probing methods to explore the distri...

Find SimilarView on arXiv

Benchmarking Multilabel Topic Classification in the Kyrgyz Language

August 30, 2023

88% Match

Anton Alekseev, Sergey I. Nikolenko, Gulnara Kabaeva

Computation and Language

Kyrgyz is a very underrepresented language in terms of modern natural language processing resources. In this work, we present a new public benchmark for topic classification in Kyrgyz, introducing a dataset based on collected and annotated data from the news site 24.KG and presenting several baseline models for news classification in the multilabel setting. We train and evaluate both classical statistical and neural models, reporting the scores, discussing the results, and pr...

Find SimilarView on arXiv

MFAQ: a Multilingual FAQ Dataset

September 27, 2021

88% Match

Bruyn Maxime De, Ehsan Lotfi, ... , Daelemans Walter

Computation and Language

In this paper, we present the first multilingual FAQ dataset publicly available. We collected around 6M FAQ pairs from the web, in 21 different languages. Although this is significantly larger than existing FAQ retrieval datasets, it comes with its own challenges: duplication of content and uneven distribution of topics. We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset. Our experiments reveal that a multilingual model base...

Find SimilarView on arXiv

Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher

October 6, 2020

88% Match

Giannis Karamanolakis, Daniel Hsu, Luis Gravano

Computation and Language

Cross-lingual text classification alleviates the need for manually labeled documents in a target language by leveraging labeled documents from other languages. Existing approaches for transferring supervision across languages require expensive cross-lingual resources, such as parallel corpora, while less expensive cross-lingual representation learning approaches train classifiers without target labeled documents. In this work, we propose a cross-lingual teacher-student method...

Find SimilarView on arXiv

Analyzing Zero-shot Cross-lingual Transfer in Supervised NLP Tasks

January 26, 2021

88% Match

Hyunjin Choi, Judong Kim, Seongho Joe, ... , Gwon Youngjune

Computation and Language

Artificial Intelligence

In zero-shot cross-lingual transfer, a supervised NLP task trained on a corpus in one language is directly applicable to another language without any additional training. A source of cross-lingual transfer can be as straightforward as lexical overlap between languages (e.g., use of the same scripts, shared subwords) that naturally forces text embeddings to occupy a similar representation space. Recently introduced cross-lingual language model (XLM) pretraining brings out neur...

Find SimilarView on arXiv

Realistic Zero-Shot Cross-Lingual Transfer in Legal Topic Classification

June 8, 2022

88% Match

Stratos Xenouleas, Alexia Tsoukara, Giannis Panagiotakis, ... , Androutsopoulos Ion

Computation and Language

We consider zero-shot cross-lingual transfer in legal topic classification using the recent MultiEURLEX dataset. Since the original dataset contains parallel documents, which is unrealistic for zero-shot cross-lingual transfer, we develop a new version of the dataset without parallel documents. We use it to show that translation-based methods vastly outperform cross-lingual fine-tuning of multilingually pre-trained models, the best previous zero-shot transfer method for Multi...

Find SimilarView on arXiv

MultiEURLEX -- A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

September 2, 2021

87% Match

Ilias Chalkidis, Manos Fergadiotis, Ion Androutsopoulos

Computation and Language

We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. We highlight the effect of temporal concept drift and the importance of chronological, instead of random splits. We use the dataset as a testbed for zero-shot cross-lingual transfer, where we exploit annotated training documents in ...

Find SimilarView on arXiv

Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation

September 1, 2019

87% Match

Aditya Siddhant, Melvin Johnson, Henry Tsai, Naveen Arivazhagan, Jason Riesa, Ankur Bapna, ... , Raman Karthik

Computation and Language

The recently proposed massively multilingual neural machine translation (NMT) system has been shown to be capable of translating over 100 languages to and from English within a single model. Its improved translation performance on low resource languages hints at potential cross-lingual transfer capability for downstream tasks. In this paper, we evaluate the cross-lingual effectiveness of representations from the encoder of a massively multilingual NMT model on 5 downstream cl...

Find SimilarView on arXiv

An Empirical Study on Crosslingual Transfer in Probabilistic Topic Models

October 13, 2018

87% Match

Shudong Hao, Michael J. Paul

Computation and Language

Probabilistic topic modeling is a popular choice as the first step of crosslingual tasks to enable knowledge transfer and extract multilingual features. While many multilingual topic models have been developed, their assumptions on the training corpus are quite varied, and it is not clear how well the models can be applied under various training conditions. In this paper, we systematically study the knowledge transfer mechanisms behind different multilingual topic models, and...

Find SimilarView on arXiv

KDSTM: Neural Semi-supervised Topic Modeling with Knowledge Distillation

July 4, 2023

87% Match

Weijie Xu, Xiaoyu Jiang, Jay Desai, Bin Han, ... , Iannacci Francis

Computation and Language

Artificial Intelligence

In text classification tasks, fine tuning pretrained language models like BERT and GPT-3 yields competitive accuracy; however, both methods require pretraining on large text datasets. In contrast, general topic modeling methods possess the advantage of analyzing documents to extract meaningful patterns of words without the need of pretraining. To leverage topic modeling's unsupervised insights extraction on text classification tasks, we develop the Knowledge Distillation Semi...

Find SimilarView on arXiv