The Russian-focused embedders' explorati...

ELMo and BERT in semantic change detection for Russian

October 7, 2020

89% Match

Julia Rodina, Yuliya Trofimova, ... , Artemova Ekaterina

Computation and Language

We study the effectiveness of contextualized embeddings for the task of diachronic semantic change detection for Russian language data. Evaluation test sets consist of Russian nouns and adjectives annotated based on their occurrences in texts created in pre-Soviet, Soviet and post-Soviet time periods. ELMo and BERT architectures are compared on the task of ranking Russian words according to the degree of their semantic change over time. We use several methods for aggregation ...

Find SimilarView on arXiv

RUSSE: The First Workshop on Russian Semantic Similarity

March 15, 2018

89% Match

Alexander Panchenko, Natalia Loukachevitch, Dmitry Ustalov, Denis Paperno, ... , Konstantinova Natalia

Computation and Language

The paper gives an overview of the Russian Semantic Similarity Evaluation (RUSSE) shared task held in conjunction with the Dialogue 2015 conference. There exist a lot of comparative studies on semantic similarity, yet no analysis of such measures was ever performed for the Russian language. Exploring this problem for the Russian language is even more interesting, because this language has features, such as rich morphology and free word order, which make it significantly diffe...

Find SimilarView on arXiv

MINERS: Multilingual Language Models as Semantic Retrievers

June 11, 2024

89% Match

Genta Indra Winata, Ruochen Zhang, David Ifeoluwa Adelani

Computation and Language

Words have been represented in a high-dimensional vector space that encodes their semantic similarities, enabling downstream applications such as retrieving synonyms, antonyms, and relevant contexts. However, despite recent advances in multilingual language models (LMs), the effectiveness of these models' representations in semantic retrieval contexts has not been comprehensively explored. To fill this gap, this paper introduces the MINERS, a benchmark designed to evaluate th...

Find SimilarView on arXiv

Russian word sense induction by clustering averaged word embeddings

May 6, 2018

89% Match

Andrey Kutuzov

Computation and Language

The paper reports our participation in the shared task on word sense induction and disambiguation for the Russian language (RUSSE-2018). Our team was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th for the bts-rnc and active-dict datasets (containing mostly polysemous words) among all 19 participants. The method we employed was extremely naive. It implied representing contexts of ambiguous words as averaged word embedding vectors, using off-the-she...

Find SimilarView on arXiv

Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging

October 19, 2024

89% Match

Mingxin Li, Zhijie Nie, Yanzhao Zhang, Dingkun Long, ... , Xie Pengjun

Computation and Language

Text embeddings are vital for tasks such as text retrieval and semantic textual similarity (STS). Recently, the advent of pretrained language models, along with unified benchmarks like the Massive Text Embedding Benchmark (MTEB), has facilitated the development of versatile general-purpose text embedding models. Advanced embedding models are typically developed using large-scale multi-task data and joint training across multiple tasks. However, our experimental analysis revea...

Find SimilarView on arXiv

Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus

January 19, 2018

89% Match

Andrey Kutuzov, Maria Kunilovskaya

Computation and Language

In this paper, we present a distributional word embedding model trained on one of the largest available Russian corpora: Araneum Russicum Maximum (over 10 billion words crawled from the web). We compare this model to the model trained on the Russian National Corpus (RNC). The two corpora are much different in their size and compilation procedures. We test these differences by evaluating the trained models against the Russian part of the Multilingual SimLex999 semantic similar...

Find SimilarView on arXiv

MMTEB: Massive Multilingual Text Embedding Benchmark

February 19, 2025

89% Match

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Çağatan, Akash Kundu, Martin Bernstorff, Shitao Xiao, Akshita Sukhlecha, Bhavish Pahwa, Rafał Poświata, Kranthi Kiran GV, Shawon Ashraf, Daniel Auras, Björn Plüster, Jan Philipp Harries, Loïc Magne, Isabelle Mohr, Mariya Hendriksen, Dawei Zhu, Hippolyte Gisserot-Boukhlef, Tom Aarsen, Jan Kostkan, Konrad Wojtasik, Taemin Lee, Marek Šuppa, Crystina Zhang, Roberta Rocca, Mohammed Hamdy, Andrianos Michail, John Yang, Manuel Faysse, Aleksei Vatolin, Nandan Thakur, Manan Dey, Dipam Vasani, Pranjal Chitale, Simone Tedeschi, Nguyen Tai, Artem Snegirev, Michael Günther, Mengzhou Xia, Weijia Shi, Xing Han Lù, Jordan Clive, Gayatri Krishnakumar, Anna Maksimova, Silvan Wehrli, Maria Tikhonova, Henil Panchal, Aleksandr Abramov, Malte Ostendorff, Zheng Liu, Simon Clematide, Lester James Miranda, Alena Fenogenova, Guangyu Song, Ruqiya Bin Safi, Wen-Ding Li, Alessia Borghini, Federico Cassano, Hongjin Su, Jimmy Lin, Howard Yen, Lasse Hansen, Sara Hooker, Chenghao Xiao, Vaibhav Adlakha, Orion Weller, ... , Muennighoff Niklas

Computation and Language

Artificial Intelligence

Information Retrieval

Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instructio...

Find SimilarView on arXiv

RuSentNE-2023: Evaluating Entity-Oriented Sentiment Analysis on Russian News Texts

May 28, 2023

89% Match

Anton Golubev, Nicolay Rusnachenko, Natalia Loukachevitch

Computation and Language

The paper describes the RuSentNE-2023 evaluation devoted to targeted sentiment analysis in Russian news texts. The task is to predict sentiment towards a named entity in a single sentence. The dataset for RuSentNE-2023 evaluation is based on the Russian news corpus RuSentNE having rich sentiment-related annotation. The corpus is annotated with named entities and sentiments towards these entities, along with related effects and emotional states. The evaluation was organized us...

Find SimilarView on arXiv

RuBia: A Russian Language Bias Detection Dataset

March 26, 2024

89% Match

Veronika Grigoreva, Anastasiia Ivanova, ... , Artemova Ekaterina

Computation and Language

Warning: this work contains upsetting or disturbing content. Large language models (LLMs) tend to learn the social and cultural biases present in the raw pre-training data. To test if an LLM's behavior is fair, functional datasets are employed, and due to their purpose, these datasets are highly language and culture-specific. In this paper, we address a gap in the scope of multilingual bias evaluation by presenting a bias detection dataset specifically designed for the Russ...

Find SimilarView on arXiv

MERA: A Comprehensive LLM Evaluation in Russian

January 9, 2024

89% Match

Alena Fenogenova, Artem Chervyakov, Nikita Martynov, Anastasia Kozlova, Maria Tikhonova, Albina Akhmetgareeva, Anton Emelyanov, Denis Shevelev, Pavel Lebedev, Leonid Sinev, Ulyana Isaeva, Katerina Kolomeytseva, Daniil Moskovskiy, Elizaveta Goncharova, Nikita Savushkin, Polina Mikhailova, Denis Dimitrov, ... , Markov Sergei

Computation and Language

Artificial Intelligence

Over the past few years, one of the most notable advancements in AI research has been in foundation models (FMs), headlined by the rise of language models (LMs). As the models' size increases, LMs demonstrate enhancements in measurable aspects and the development of new qualitative features. However, despite researchers' attention and the rapid growth in LM application, the capabilities, limitations, and associated risks still need to be better understood. To address these is...

Find SimilarView on arXiv

The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design

ELMo and BERT in semantic change detection for Russian

RUSSE: The First Workshop on Russian Semantic Similarity

MINERS: Multilingual Language Models as Semantic Retrievers

Russian word sense induction by clustering averaged word embeddings

Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging

Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus

MMTEB: Massive Multilingual Text Embedding Benchmark

RuSentNE-2023: Evaluating Entity-Oriented Sentiment Analysis on Russian News Texts

RuBia: A Russian Language Bias Detection Dataset

MERA: A Comprehensive LLM Evaluation in Russian