The Russian-focused embedders' explorati...

RuSentEval: Linguistic Source, Encoder Force!

February 28, 2021

91% Match

Vladislav Mikhailov, Ekaterina Taktasheva, ... , Artemova Ekaterina

Computation and Language

The success of pre-trained transformer language models has brought a great deal of interest on how these models work, and what they learn about language. However, prior research in the field is mainly devoted to English, and little is known regarding other languages. To this end, we introduce RuSentEval, an enhanced set of 14 probing tasks for Russian, including ones that have not been explored yet. We apply a combination of complementary probing methods to explore the distri...

Find SimilarView on arXiv

Improving Results on Russian Sentiment Datasets

July 28, 2020

90% Match

Anton Golubev, Natalia Loukachevitch

Computation and Language

In this study, we test standard neural network architectures (CNN, LSTM, BiLSTM) and recently appeared BERT architectures on previous Russian sentiment evaluation datasets. We compare two variants of Russian BERT and show that for all sentiment tasks in this study the conversational variant of Russian BERT performs better. The best results were achieved by BERT-NLI model, which treats sentiment classification tasks as a natural language inference task. On one of the datasets,...

Find SimilarView on arXiv

Arctic-Embed 2.0: Multilingual Retrieval Without Compromise

December 3, 2024

90% Match

Puxuan Yu, Luke Merrick, ... , Campos Daniel

Computation and Language

Information Retrieval

Machine Learning

This paper presents the training methodology of Arctic-Embed 2.0, a set of open-source text embedding models built for accurate and efficient multilingual retrieval. While prior works have suffered from degraded English retrieval quality, Arctic-Embed 2.0 delivers competitive retrieval quality on multilingual and English-only benchmarks, and supports Matryoshka Representation Learning (MRL) for efficient embedding storage with significantly lower compressed quality degradatio...

Find SimilarView on arXiv

FaMTEB: Massive Text Embedding Benchmark in Persian Language

February 17, 2025

90% Match

Erfan Zinvandi, Morteza Alikhani, Mehran Sarmadi, Zahra Pourbahman, Sepehr Arvin, ... , Amini Arash

Computation and Language

Information Retrieval

Machine Learning

In this paper, we introduce a comprehensive benchmark for Persian (Farsi) text embeddings, built upon the Massive Text Embedding Benchmark (MTEB). Our benchmark includes 63 datasets spanning seven different tasks: classification, clustering, pair classification, reranking, retrieval, summary retrieval, and semantic textual similarity. The datasets are formed as a combination of existing, translated, and newly generated data, offering a diverse evaluation framework for Persian...

Find SimilarView on arXiv

Long Input Benchmark for Russian Analysis

August 5, 2024

90% Match

Igor Churin, Murat Apishev, Maria Tikhonova, Denis Shevelev, Aydar Bulatov, Yuri Kuratov, ... , Fenogenova Alena

Computation and Language

Artificial Intelligence

Recent advancements in Natural Language Processing (NLP) have fostered the development of Large Language Models (LLMs) that can solve an immense variety of tasks. One of the key aspects of their application is their ability to work with long text documents and to process long sequences of tokens. This has created a demand for proper evaluation of long-context understanding. To address this need for the Russian language, we propose LIBRA (Long Input Benchmark for Russian Analy...

Find SimilarView on arXiv

Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models

May 8, 2024

90% Match

Luke Merrick, Danmei Xu, ... , Campos Daniel

Computation and Language

Artificial Intelligence

Information Retrieval

This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard, with the largest model, arctic-embed-l outperforming closed source embedding models such...

Find SimilarView on arXiv

Evaluation of Morphological Embeddings for English and Russian Languages

March 11, 2021

90% Match

Vitaly Romanov, Albina Khusainova

Computation and Language

This paper evaluates morphology-based embeddings for English and Russian languages. Despite the interest and introduction of several morphology-based word embedding models in the past and acclaimed performance improvements on word similarity and language modeling tasks, in our experiments, we did not observe any stable preference over two of our baseline models - SkipGram and FastText. The performance exhibited by morphological embeddings is the average of the two baselines m...

Find SimilarView on arXiv

The Limitations of Cross-language Word Embeddings Evaluation

June 6, 2018

90% Match

Amir Bakarov, Roman Suvorov, Ilya Sochenkov

Computation and Language

The aim of this work is to explore the possible limitations of existing methods of cross-language word embeddings evaluation, addressing the lack of correlation between intrinsic and extrinsic cross-language evaluation methods. To prove this hypothesis, we construct English-Russian datasets for extrinsic and intrinsic evaluation tasks and compare performances of 5 different cross-language models on them. The results say that the scores even on different intrinsic benchmarks d...

Find SimilarView on arXiv

Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings

February 26, 2024

90% Match

Isabelle Mohr, Markus Krimmel, Saba Sturua, Mohammad Kalim Akram, Andreas Koukounas, Michael Günther, Georgios Mastrapas, Vinit Ravishankar, Joan Fontanals Martínez, Feng Wang, Qi Liu, Ziniu Yu, Jie Fu, Saahil Ognawala, Susana Guzman, Bo Wang, Maximilian Werk, ... , Xiao Han

Computation and Language

Artificial Intelligence

Information Retrieval

We introduce a novel suite of state-of-the-art bilingual text embedding models that are designed to support English and another target language. These models are capable of processing lengthy text inputs with up to 8192 tokens, making them highly versatile for a range of natural language processing tasks such as text retrieval, clustering, and semantic textual similarity (STS) calculations. By focusing on bilingual models and introducing a unique multi-task learning objecti...

Find SimilarView on arXiv

Lexicon-based Methods vs. BERT for Text Sentiment Analysis

November 19, 2021

90% Match

Anastasia Kotelnikova, Danil Paschenko, ... , Kotelnikov Evgeny

Computation and Language

The performance of sentiment analysis methods has greatly increased in recent years. This is due to the use of various models based on the Transformer architecture, in particular BERT. However, deep neural network models are difficult to train and poorly interpretable. An alternative approach is rule-based methods using sentiment lexicons. They are fast, require no training, and are well interpreted. But recently, due to the widespread use of deep learning, lexicon-based meth...

Find SimilarView on arXiv

The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design

RuSentEval: Linguistic Source, Encoder Force!

Improving Results on Russian Sentiment Datasets

Arctic-Embed 2.0: Multilingual Retrieval Without Compromise

FaMTEB: Massive Text Embedding Benchmark in Persian Language

Long Input Benchmark for Russian Analysis

Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models

Evaluation of Morphological Embeddings for English and Russian Languages

The Limitations of Cross-language Word Embeddings Evaluation

Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings

Lexicon-based Methods vs. BERT for Text Sentiment Analysis