The Russian-focused embedders' explorati...

RUSSE'2020: Findings of the First Taxonomy Enrichment Task for the Russian language

May 22, 2020

88% Match

Irina Nikishina, Varvara Logacheva, ... , Loukachevitch Natalia

Computation and Language

Artificial Intelligence

This paper describes the results of the first shared task on taxonomy enrichment for the Russian language. The participants were asked to extend an existing taxonomy with previously unseen words: for each new word their systems should provide a ranked list of possible (candidate) hypernyms. In comparison to the previous tasks for other languages, our competition has a more realistic task setting: new words were provided without definitions. Instead, we provided a textual corp...

Find SimilarView on arXiv

Rotations and Interpretability of Word Embeddings: the Case of the Russian Language

July 15, 2017

88% Match

Alexey Zobnin

Computation and Language

Consider a continuous word embedding model. Usually, the cosines between word vectors are used as a measure of similarity of words. These cosines do not change under orthogonal transformations of the embedding space. We demonstrate that, using some canonical orthogonal transformations from SVD, it is possible both to increase the meaning of some components and to make the components more stable under re-learning. We study the interpretability of components for publicly availa...

Find SimilarView on arXiv

Improving Text Embeddings with Large Language Models

December 31, 2023

88% Match

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, ... , Wei Furu

Computation and Language

Information Retrieval

In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task ...

Find SimilarView on arXiv

BERT: A Review of Applications in Natural Language Processing and Understanding

March 22, 2021

88% Match

M. V. Koroteev

Computation and Language

Artificial Intelligence

Machine Learning

In this review, we describe the application of one of the most popular deep learning-based language models - BERT. The paper describes the mechanism of operation of this model, the main areas of its application to the tasks of text analytics, comparisons with similar models in each task, as well as a description of some proprietary models. In preparing this review, the data of several dozen original scientific articles published over the past few years, which attracted the mo...

Find SimilarView on arXiv

Automatically Ranked Russian Paraphrase Corpus for Text Generation

June 17, 2020

88% Match

Vadim Gudkov, Olga Mitrofanova, Elizaveta Filippskikh

Computation and Language

The article is focused on automatic development and ranking of a large corpus for Russian paraphrase generation which proves to be the first corpus of such type in Russian computational linguistics. Existing manually annotated paraphrase datasets for Russian are limited to small-sized ParaPhraser corpus and ParaPlag which are suitable for a set of NLP tasks, such as paraphrase and plagiarism detection, sentence similarity and relatedness estimation, etc. Due to size restricti...

Find SimilarView on arXiv

TAPE: Assessing Few-shot Russian Language Understanding

October 23, 2022

88% Match

Ekaterina Taktasheva, Tatiana Shavrina, Alena Fenogenova, Denis Shevelev, Nadezhda Katricheva, Maria Tikhonova, Albina Akhmetgareeva, Oleg Zinkevich, Anastasiia Bashmakova, Svetlana Iordanskaia, Alena Spiridonova, Valentina Kurenshchikova, ... , Mikhailov Vladislav

Computation and Language

Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six more complex NLU tasks for Russian, covering multi-hop reasoning, ethical concepts...

Find SimilarView on arXiv

The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding

June 4, 2024

88% Match

Kenneth Enevoldsen, Márton Kardos, ... , Nielbo Kristoffer Laigaard

Computation and Language

Artificial Intelligence

The evaluation of English text embeddings has transitioned from evaluating a handful of datasets to broad coverage across many tasks through benchmarks such as MTEB. However, this is not the case for multilingual text embeddings due to a lack of available benchmarks. To address this problem, we introduce the Scandinavian Embedding Benchmark (SEB). SEB is a comprehensive framework that enables text embedding evaluation for Scandinavian languages across 24 tasks, 10 subtasks, a...

Find SimilarView on arXiv

Facilitating large language model Russian adaptation with Learned Embedding Propagation

December 30, 2024

88% Match

Mikhail Tikhomirov, Daniil Chernyshev

Computation and Language

Artificial Intelligence

Rapid advancements of large language model (LLM) technologies led to the introduction of powerful open-source instruction-tuned LLMs that have the same text generation quality as the state-of-the-art counterparts such as GPT-4. While the emergence of such models accelerates the adoption of LLM technologies in sensitive-information environments the authors of such models don not disclose the training data necessary for replication of the results thus making the achievements mo...

Find SimilarView on arXiv

Gibberish Semantics: How Good is Russian Twitter in Word Semantic Similarity Task?

February 28, 2016

88% Match

Nikolay N. Vasiliev

Computation and Language

The most studied and most successful language models were developed and evaluated mainly for English and other close European languages, such as French, German, etc. It is important to study applicability of these models to other languages. The use of vector space models for Russian was recently studied for multiple corpora, such as Wikipedia, RuWac, lib.ru. These models were evaluated against word semantic similarity task. For our knowledge Twitter was not considered as a co...

Find SimilarView on arXiv

RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs

June 27, 2024

88% Match

Ekaterina Taktasheva, Maxim Bazhukov, Kirill Koncha, Alena Fenogenova, ... , Mikhailov Vladislav

Computation and Language

Minimal pairs are a well-established approach to evaluating the grammatical knowledge of language models. However, existing resources for minimal pairs address a limited number of languages and lack diversity of language-specific grammatical phenomena. This paper introduces the Russian Benchmark of Linguistic Minimal Pairs (RuBLiMP), which includes 45k pairs of sentences that differ in grammaticality and isolate a morphological, syntactic, or semantic phenomenon. In contrast ...

Find SimilarView on arXiv

The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design

RUSSE'2020: Findings of the First Taxonomy Enrichment Task for the Russian language

Rotations and Interpretability of Word Embeddings: the Case of the Russian Language

Improving Text Embeddings with Large Language Models

BERT: A Review of Applications in Natural Language Processing and Understanding

Automatically Ranked Russian Paraphrase Corpus for Text Generation

TAPE: Assessing Few-shot Russian Language Understanding

The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding

Facilitating large language model Russian adaptation with Learned Embedding Propagation

Gibberish Semantics: How Good is Russian Twitter in Word Semantic Similarity Task?

RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs