The Russian-focused embedders' explorati...

A Family of Pretrained Transformer Language Models for Russian

September 19, 2023

93% Match

Dmitry Zmitrovich, Alexander Abramov, Andrey Kalmykov, Maria Tikhonova, Ekaterina Taktasheva, Danil Astafurov, Mark Baushenko, Artem Snegirev, Tatiana Shavrina, Sergey Markov, ... , Fenogenova Alena

Computation and Language

Nowadays, Transformer language models (LMs) represent a fundamental component of the NLP research methodologies and applications. However, the development of such models specifically for the Russian language has received little attention. This paper presents a collection of 13 Russian Transformer LMs based on the encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and encoder-decoder (ruT5, FRED-T5) models in multiple sizes. Access to these models is readily available ...

Find SimilarView on arXiv

MTEB: Massive Text Embedding Benchmark

October 13, 2022

92% Match

Niklas Muennighoff, Nouamane Tazi, ... , Reimers Nils

Computation and Language

Information Retrieval

Machine Learning

Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation. To solve this problem, we introduce the Massive Te...

Find SimilarView on arXiv

Evaluation of Morphological Embeddings for the Russian Language

March 11, 2021

92% Match

Vitaly Romanov, Albina Khusainova

Computation and Language

Machine Learning

A number of morphology-based word embedding models were introduced in recent years. However, their evaluation was mostly limited to English, which is known to be a morphologically simple language. In this paper, we explore whether and to what extent incorporating morphology into word embeddings improves performance on downstream NLP tasks, in the case of morphologically rich Russian language. NLP tasks of our choice are POS tagging, Chunking, and NER -- for Russian language, ...

Find SimilarView on arXiv

Sentence Embeddings for Russian NLU

October 29, 2019

92% Match

Dmitry Popov, Alexander Pugachev, Polina Svyatokum, ... , Artemova Ekaterina

Computation and Language

Machine Learning

We investigate the performance of sentence embeddings models on several tasks for the Russian language. In our comparison, we include such tasks as multiple choice question answering, next sentence prediction, and paraphrase identification. We employ FastText embeddings as a baseline and compare it to ELMo and BERT embeddings. We conduct two series of experiments, using both unsupervised (i.e., based on similarity measure only) and supervised approaches for the tasks. Finally...

Find SimilarView on arXiv

RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

October 29, 2020

91% Match

Tatiana Shavrina, Alena Fenogenova, Anton Emelyanov, Denis Shevelev, Ekaterina Artemova, Valentin Malykh, Vladislav Mikhailov, Maria Tikhonova, ... , Evlampiev Andrey

Computation and Language

Artificial Intelligence

In this paper, we introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logical operations regardless of text subject or lexicon. For the first time, a benc...

Find SimilarView on arXiv

Texts in, meaning out: neural language models in semantic similarity task for Russian

April 30, 2015

91% Match

Andrey Kutuzov, Igor Andreev

Computation and Language

Distributed vector representations for natural language vocabulary get a lot of attention in contemporary computational linguistics. This paper summarizes the experience of applying neural network language models to the task of calculating semantic similarity for Russian. The experiments were performed in the course of Russian Semantic Similarity Evaluation track, where our models took from the 2nd to the 5th position, depending on the task. We introduce the tools and corpo...

Find SimilarView on arXiv

Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP models

February 15, 2022

91% Match

Alena Fenogenova, Maria Tikhonova, Vladislav Mikhailov, Tatiana Shavrina, Anton Emelyanov, Denis Shevelev, Alexandr Kukushkin, ... , Artemova Ekaterina

Computation and Language

Artificial Intelligence

In the last year, new neural architectures and multilingual pre-trained models have been released for Russian, which led to performance evaluation problems across a range of language understanding tasks. This paper presents Russian SuperGLUE 1.1, an updated benchmark styled after GLUE for Russian NLP models. The new version includes a number of technical, user experience and methodological improvements, including fixes of the benchmark vulnerabilities unresolved in the prev...

Find SimilarView on arXiv

RuBioRoBERTa: a pre-trained biomedical language model for Russian language biomedical text mining

April 8, 2022

91% Match

Alexander Yalunin, Alexander Nesterov, Dmitriy Umerenkov

Computation and Language

Artificial Intelligence

This paper presents several BERT-based models for Russian language biomedical text mining (RuBioBERT, RuBioRoBERTa). The models are pre-trained on a corpus of freely available texts in the Russian biomedical domain. With this pre-training, our models demonstrate state-of-the-art results on RuMedBench - Russian medical language understanding benchmark that covers a diverse set of tasks, including text classification, question answering, natural language inference, and named en...

Find SimilarView on arXiv

Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language

May 17, 2019

91% Match

Yuri Kuratov, Mikhail Arkhipov

Computation and Language

The paper introduces methods of adaptation of multilingual masked language models for a specific language. Pre-trained bidirectional language models show state-of-the-art performance on a wide range of tasks including reading comprehension, natural language inference, and sentiment analysis. At the moment there are two alternative approaches to train such models: monolingual and multilingual. While language specific models show superior performance, multilingual models allow ...

Find SimilarView on arXiv

Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark

May 27, 2024

91% Match

Hongliu Cao

Information Retrieval

Artificial Intelligence

Computation and Language

Text embedding methods have become increasingly popular in both industrial and academic fields due to their critical role in a variety of natural language processing tasks. The significance of universal text embeddings has been further highlighted with the rise of Large Language Models (LLMs) applications such as Retrieval-Augmented Systems (RAGs). While previous models have attempted to be general-purpose, they often struggle to generalize across tasks and domains. However, ...

Find SimilarView on arXiv

The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design

A Family of Pretrained Transformer Language Models for Russian

MTEB: Massive Text Embedding Benchmark

Evaluation of Morphological Embeddings for the Russian Language

Sentence Embeddings for Russian NLU

RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

Texts in, meaning out: neural language models in semantic similarity task for Russian

Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP models

RuBioRoBERTa: a pre-trained biomedical language model for Russian language biomedical text mining

Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language

Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark