The Russian-focused embedders' explorati...

Extending the Massive Text Embedding Benchmark to French

May 30, 2024

89% Match

Mathieu Ciancone, Imene Kerboua, ... , Siblini Wissam

Computation and Language

Information Retrieval

Machine Learning

In recent years, numerous embedding models have been made available and widely used for various NLP tasks. Choosing a model that performs well for several tasks in English has been largely simplified by the Massive Text Embedding Benchmark (MTEB), but extensions to other languages remain challenging. This is why we expand MTEB to propose the first massive benchmark of sentence embeddings for French. Not only we gather 22 existing datasets in an easy-to-use interface, but we a...

Find SimilarView on arXiv

A new approach to calculating BERTScore for automatic assessment of translation quality

March 10, 2022

89% Match

A. A. Vetrov, E. A. Gorn

Computation and Language

Artificial Intelligence

The study of the applicability of the BERTScore metric was conducted to translation quality assessment at the sentence level for English -> Russian direction. Experiments were performed with a pre-trained Multilingual BERT as well as with a pair of Monolingual BERT models. To align monolingual embeddings, an orthogonal transformation based on anchor tokens was used. It was demonstrated that such transformation helps to prevent mismatching issue and shown that this approach gi...

Find SimilarView on arXiv

PL-MTEB: Polish Massive Text Embedding Benchmark

May 16, 2024

89% Match

Rafał Poświata, Sławomir Dadas, Michał Perełkiewicz

Computation and Language

In this paper, we introduce the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in Polish. The PL-MTEB consists of 28 diverse NLP tasks from 5 task types. We adapted the tasks based on previously used datasets by the Polish NLP community. In addition, we created a new PLSC (Polish Library of Science Corpus) dataset consisting of titles and abstracts of scientific publications in Polish, which was used as the basis for two novel...

Find SimilarView on arXiv

Transfer Learning for Improving Results on Russian Sentiment Datasets

July 6, 2021

89% Match

Anton Golubev, Natalia Loukachevitch

Computation and Language

In this study, we test transfer learning approach on Russian sentiment benchmark datasets using additional train sample created with distant supervision technique. We compare several variants of combining additional data with benchmark train samples. The best results were achieved using three-step approach of sequential training on general, thematic and original train samples. For most datasets, the results were improved by more than 3% to the current state-of-the-art methods...

Find SimilarView on arXiv

Monolingual and Cross-Lingual Knowledge Transfer for Topic Classification

June 13, 2023

89% Match

Dmitry Karpov, Mikhail Burtsev

Computation and Language

Artificial Intelligence

This article investigates the knowledge transfer from the RuQTopics dataset. This Russian topical dataset combines a large sample number (361,560 single-label, 170,930 multi-label) with extensive class coverage (76 classes). We have prepared this dataset from the "Yandex Que" raw data. By evaluating the RuQTopics - trained models on the six matching classes of the Russian MASSIVE subset, we have proved that the RuQTopics dataset is suitable for real-world conversational tasks...

Find Similar View on arXiv

Current Landscape of the Russian Sentiment Corpora

June 28, 2021

89% Match

Evgeny Kotelnikov

Computation and Language

Currently, there are more than a dozen Russian-language corpora for sentiment analysis, differing in the source of the texts, domain, size, number and ratio of sentiment classes, and annotation method. This work examines publicly available Russian-language corpora, presents their qualitative and quantitative characteristics, which make it possible to get an idea of the current landscape of the corpora for sentiment analysis. The ranking of corpora by annotation quality is pro...

Find SimilarView on arXiv

Vikhr: The Family of Open-Source Instruction-Tuned Large Language Models for Russian

May 22, 2024

89% Match

Aleksandr Nikolich, Konstantin Korolev, Artem Shelmanov

Computation and Language

Artificial Intelligence

There has been a surge in the development of various Large Language Models (LLMs). However, text generation for languages other than English often faces significant challenges, including poor generation quality and the reduced computational performance due to the disproportionate representation of tokens in model's vocabulary. In this work, we address these issues and introduce Vikhr, a new state-of-the-art open-source instruction-tuned LLM designed specifically for the Russi...

Find SimilarView on arXiv

Transformers for Headline Selection for Russian News Clusters

June 19, 2021

89% Match

Pavel Voropaev, Olga Sopilnyak

Computation and Language

In this paper, we explore various multilingual and Russian pre-trained transformer-based models for the Dialogue Evaluation 2021 shared task on headline selection. Our experiments show that the combined approach is superior to individual multilingual and monolingual models. We present an analysis of a number of ways to obtain sentence embeddings and learn a ranking model on top of them. We achieve the result of 87.28% and 86.60% accuracy for the public and private test sets r...

Find SimilarView on arXiv

Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

May 3, 2021

89% Match

Tatyana Iazykova, Denis Kapelyushnik, ... , Kutuzov Andrey

Computation and Language

Leader-boards like SuperGLUE are seen as important incentives for active development of NLP, since they provide standard benchmarks for fair comparison of modern language models. They have driven the world's best engineering teams as well as their resources to collaborate and solve a set of tasks for general language understanding. Their performance scores are often claimed to be close to or even higher than the human performance. These results encouraged more thorough analys...

Find SimilarView on arXiv

RuMedBench: A Russian Medical Language Understanding Benchmark

January 17, 2022

89% Match

Pavel Blinov, Arina Reshetnikova, Aleksandr Nesterov, ... , Kokh Vladimir

Computation and Language

Artificial Intelligence

Machine Learning

The paper describes the open Russian medical language understanding benchmark covering several task types (classification, question answering, natural language inference, named entity recognition) on a number of novel text sets. Given the sensitive nature of the data in healthcare, such a benchmark partially closes the problem of Russian medical dataset absence. We prepare the unified format labeling, data split, and evaluation metrics for new tasks. The remaining tasks are f...

Find SimilarView on arXiv

The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design

Extending the Massive Text Embedding Benchmark to French

A new approach to calculating BERTScore for automatic assessment of translation quality

PL-MTEB: Polish Massive Text Embedding Benchmark

Transfer Learning for Improving Results on Russian Sentiment Datasets

Monolingual and Cross-Lingual Knowledge Transfer for Topic Classification

Current Landscape of the Russian Sentiment Corpora

Vikhr: The Family of Open-Source Instruction-Tuned Large Language Models for Russian

Transformers for Headline Selection for Russian News Clusters

Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

RuMedBench: A Russian Medical Language Understanding Benchmark