ai-forever / ru-en-RoSBERTa

huggingface.co
Total runs: 7.2K
24-hour runs: 234
7-day runs: 1.5K
30-day runs: -3.4K
Model's Last Updated: September 26 2024
feature-extraction

Introduction of ru-en-RoSBERTa

Model Details of ru-en-RoSBERTa

Model Card for ru-en-RoSBERTa

The ru-en-RoSBERTa is a general text embedding model for Russian. The model is based on ruRoBERTa and fine-tuned with ~4M pairs of supervised, synthetic and unsupervised data in Russian and English. Tokenizer supports some English tokens from RoBERTa tokenizer.

For more model details please refer to our article .

Usage

The model can be used as is with prefixes. It is recommended to use CLS pooling. The choice of prefix and pooling depends on the task.

We use the following basic rules to choose a prefix:

  • "search_query: " and "search_document: " prefixes are for answer or relevant paragraph retrieval
  • "classification: " prefix is for symmetric paraphrasing related tasks (STS, NLI, Bitext Mining)
  • "clustering: " prefix is for any tasks that rely on thematic features (topic classification, title-body retrieval)

To better tailor the model to your needs, you can fine-tune it with relevant high-quality Russian and English datasets.

Below are examples of texts encoding using the Transformers and SentenceTransformers libraries.

Transformers
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel


def pool(hidden_state, mask, pooling_method="cls"):
    if pooling_method == "mean":
        s = torch.sum(hidden_state * mask.unsqueeze(-1).float(), dim=1)
        d = mask.sum(axis=1, keepdim=True).float()
        return s / d
    elif pooling_method == "cls":
        return hidden_state[:, 0]

inputs = [
    # 
    "classification: Он нам и <unk> не нужон ваш Интернет!",
    "clustering: В Ярославской области разрешили работу бань, но без посетителей",
    "search_query: Сколько программистов нужно, чтобы вкрутить лампочку?",

    # 
    "classification: What a time to be alive!",
    "clustering: Ярославским баням разрешили работать без посетителей",
    "search_document: Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.",
]

tokenizer = AutoTokenizer.from_pretrained("ai-forever/ru-en-RoSBERTa")
model = AutoModel.from_pretrained("ai-forever/ru-en-RoSBERTa")

tokenized_inputs = tokenizer(inputs, max_length=512, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**tokenized_inputs)
    
embeddings = pool(
    outputs.last_hidden_state, 
    tokenized_inputs["attention_mask"],
    pooling_method="cls" # or try "mean"
)

embeddings = F.normalize(embeddings, p=2, dim=1)

sim_scores = embeddings[:3] @ embeddings[3:].T
print(sim_scores.diag().tolist())
# [0.4796873927116394, 0.9409002065658569, 0.7761015892028809]
SentenceTransformers
from sentence_transformers import SentenceTransformer


inputs = [
    # 
    "classification: Он нам и <unk> не нужон ваш Интернет!",
    "clustering: В Ярославской области разрешили работу бань, но без посетителей",
    "search_query: Сколько программистов нужно, чтобы вкрутить лампочку?",

    # 
    "classification: What a time to be alive!",
    "clustering: Ярославским баням разрешили работать без посетителей",
    "search_document: Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.",
]

# loads model with CLS pooling
model = SentenceTransformer("ai-forever/ru-en-RoSBERTa")

# embeddings are normalized by default
embeddings = model.encode(inputs, convert_to_tensor=True)

sim_scores = embeddings[:3] @ embeddings[3:].T
print(sim_scores.diag().tolist())
# [0.47968706488609314, 0.940900444984436, 0.7761018872261047]
Citation
@misc{snegirev2024russianfocusedembeddersexplorationrumteb,
      title={The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design}, 
      author={Artem Snegirev and Maria Tikhonova and Anna Maksimova and Alena Fenogenova and Alexander Abramov},
      year={2024},
      eprint={2408.12503},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.12503}, 
}
Limitations

The model is designed to process texts in Russian, the quality in English is unknown. Maximum input text length is limited to 512 tokens.

Runs of ai-forever ru-en-RoSBERTa on huggingface.co

7.2K
Total runs
234
24-hour runs
864
3-day runs
1.5K
7-day runs
-3.4K
30-day runs

More Information About ru-en-RoSBERTa huggingface.co Model

More ru-en-RoSBERTa license Visit here:

https://choosealicense.com/licenses/mit

ru-en-RoSBERTa huggingface.co

ru-en-RoSBERTa huggingface.co is an AI model on huggingface.co that provides ru-en-RoSBERTa's model effect (), which can be used instantly with this ai-forever ru-en-RoSBERTa model. huggingface.co supports a free trial of the ru-en-RoSBERTa model, and also provides paid use of the ru-en-RoSBERTa. Support call ru-en-RoSBERTa model through api, including Node.js, Python, http.

ai-forever ru-en-RoSBERTa online free

ru-en-RoSBERTa huggingface.co is an online trial and call api platform, which integrates ru-en-RoSBERTa's modeling effects, including api services, and provides a free online trial of ru-en-RoSBERTa, you can try ru-en-RoSBERTa online for free by clicking the link below.

ai-forever ru-en-RoSBERTa online free url in huggingface.co:

https://huggingface.co/ai-forever/ru-en-RoSBERTa

ru-en-RoSBERTa install

ru-en-RoSBERTa is an open source model from GitHub that offers a free installation service, and any user can find ru-en-RoSBERTa on GitHub to install. At the same time, huggingface.co provides the effect of ru-en-RoSBERTa install, users can directly use ru-en-RoSBERTa installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

ru-en-RoSBERTa install url in huggingface.co:

https://huggingface.co/ai-forever/ru-en-RoSBERTa

Url of ru-en-RoSBERTa

ru-en-RoSBERTa huggingface.co Url

Provider of ru-en-RoSBERTa huggingface.co

ai-forever
ORGANIZATIONS

Other API from ai-forever

huggingface.co

Total runs: 525.5K
Run Growth: 507.4K
Growth Rate: 96.56%
Updated: November 03 2023
huggingface.co

Total runs: 10.6K
Run Growth: 1.5K
Growth Rate: 13.75%
Updated: December 05 2023
huggingface.co

Total runs: 8.2K
Run Growth: 5.1K
Growth Rate: 59.45%
Updated: December 29 2024
huggingface.co

Total runs: 5.9K
Run Growth: 3.5K
Growth Rate: 60.26%
Updated: December 11 2023
huggingface.co

Total runs: 2.3K
Run Growth: -408
Growth Rate: -17.86%
Updated: December 05 2023
huggingface.co

Total runs: 1.3K
Run Growth: -6.3K
Growth Rate: -493.39%
Updated: December 05 2023
huggingface.co

Total runs: 315
Run Growth: 158
Growth Rate: 50.16%
Updated: January 26 2023
huggingface.co

Total runs: 0
Run Growth: 0
Growth Rate: 0.00%
Updated: December 24 2021
huggingface.co

Total runs: 0
Run Growth: 0
Growth Rate: 0.00%
Updated: June 08 2023
huggingface.co

Total runs: 0
Run Growth: 0
Growth Rate: 0.00%
Updated: September 21 2021