ai-forever / FRIDA

huggingface.co
Total runs: 8.2K
24-hour runs: -633
7-day runs: -603
30-day runs: 5.1K
Model's Last Updated: December 29 2024
feature-extraction

Introduction of FRIDA

Model Details of FRIDA

Model Card for FRIDA

FRIDA is a full-scale finetuned general text embedding model inspired by denoising architecture based on T5. The model is based on the encoder part of FRED-T5 model and continues research of text embedding models ( ruMTEB , ru-en-RoSBERTa ). It has been pre-trained on a Russian-English dataset and fine-tuned for improved performance on the target task.

For more model details please refer to our technical report [TODO].

Usage

The model can be used as is with prefixes. It is recommended to use CLS pooling. The choice of prefix and pooling depends on the task.

We use the following basic rules to choose a prefix:

  • "search_query: " and "search_document: " prefixes are for answer or relevant paragraph retrieval
  • "paraphrase: " prefix is for symmetric paraphrasing related tasks (STS, paraphrase mining, deduplication)
  • "categorize: " prefix is for asymmetric matching of document title and body (e.g. news, scientific papers, social posts)
  • "categorize_sentiment: " prefix is for any tasks that rely on sentiment features (e.g. hate, toxic, emotion)
  • "categorize_topic: " prefix is intended for tasks where you need to group texts by topic
  • "categorize_entailment: " prefix is for textual entailment task (NLI)

To better tailor the model to your needs, you can fine-tune it with relevant high-quality Russian and English datasets.

Below are examples of texts encoding using the Transformers and SentenceTransformers libraries.

Transformers
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, T5EncoderModel


def pool(hidden_state, mask, pooling_method="cls"):
    if pooling_method == "mean":
        s = torch.sum(hidden_state * mask.unsqueeze(-1).float(), dim=1)
        d = mask.sum(axis=1, keepdim=True).float()
        return s / d
    elif pooling_method == "cls":
        return hidden_state[:, 0]

inputs = [
    # 
    "paraphrase: В Ярославской области разрешили работу бань, но без посетителей",
    "categorize_entailment: Женщину доставили в больницу, за ее жизнь сейчас борются врачи.",
    "search_query: Сколько программистов нужно, чтобы вкрутить лампочку?",
    # 
    "paraphrase: Ярославским баням разрешили работать без посетителей",
    "categorize_entailment: Женщину спасают врачи.",
    "search_document: Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование."
]

tokenizer = AutoTokenizer.from_pretrained("ai-forever/FRIDA")
model = T5EncoderModel.from_pretrained("ai-forever/FRIDA")

tokenized_inputs = tokenizer(inputs, max_length=512, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**tokenized_inputs)
    
embeddings = pool(
    outputs.last_hidden_state, 
    tokenized_inputs["attention_mask"],
    pooling_method="cls" # or try "mean"
)

embeddings = F.normalize(embeddings, p=2, dim=1)
sim_scores = embeddings[:3] @ embeddings[3:].T
print(sim_scores.diag().tolist())
# [0.9360030293464661, 0.8591322302818298, 0.728583037853241]
SentenceTransformers
from sentence_transformers import SentenceTransformer

inputs = [
    # 
    "paraphrase: В Ярославской области разрешили работу бань, но без посетителей",
    "categorize_entailment: Женщину доставили в больницу, за ее жизнь сейчас борются врачи.",
    "search_query: Сколько программистов нужно, чтобы вкрутить лампочку?",
    # 
    "paraphrase: Ярославским баням разрешили работать без посетителей",
    "categorize_entailment: Женщину спасают врачи.",
    "search_document: Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование."
]

# loads model with CLS pooling
model = SentenceTransformer("ai-forever/FRIDA")

# embeddings are normalized by default
embeddings = model.encode(inputs, convert_to_tensor=True)

sim_scores = embeddings[:3] @ embeddings[3:].T
print(sim_scores.diag().tolist())
# [0.9360026717185974, 0.8591331243515015, 0.7285830974578857]

or using prompts (sentence-transformers>=2.4.0):

from sentence_transformers import SentenceTransformer

# loads model with CLS pooling
model = SentenceTransformer("ai-forever/FRIDA")

paraphrase = model.encode(["В Ярославской области разрешили работу бань, но без посетителей", "Ярославским баням разрешили работать без посетителей"], prompt_name="paraphrase")
print(paraphrase[0] @ paraphrase[1].T) # 0.9360032

categorize_entailment = model.encode(["Женщину доставили в больницу, за ее жизнь сейчас борются врачи.", "Женщину спасают врачи."], prompt_name="categorize_entailment")
print(categorize_entailment[0] @ categorize_entailment[1].T) # 0.8591322

query_embedding = model.encode("Сколько программистов нужно, чтобы вкрутить лампочку?", prompt_name="search_query")
document_embedding = model.encode("Чтобы вкрутить лампочку, требуется три программиста: один напишет программу извлечения лампочки, другой — вкручивания лампочки, а третий проведет тестирование.", prompt_name="search_document")
print(query_embedding @ document_embedding.T) # 0.7285831
Authors
Citation
@misc{TODO
}
Limitations

The model is designed to process texts in Russian, the quality in English is unknown. Maximum input text length is limited to 512 tokens.

Runs of ai-forever FRIDA on huggingface.co

8.2K
Total runs
-633
24-hour runs
-455
3-day runs
-603
7-day runs
5.1K
30-day runs

More Information About FRIDA huggingface.co Model

More FRIDA license Visit here:

https://choosealicense.com/licenses/mit

FRIDA huggingface.co

FRIDA huggingface.co is an AI model on huggingface.co that provides FRIDA's model effect (), which can be used instantly with this ai-forever FRIDA model. huggingface.co supports a free trial of the FRIDA model, and also provides paid use of the FRIDA. Support call FRIDA model through api, including Node.js, Python, http.

ai-forever FRIDA online free

FRIDA huggingface.co is an online trial and call api platform, which integrates FRIDA's modeling effects, including api services, and provides a free online trial of FRIDA, you can try FRIDA online for free by clicking the link below.

ai-forever FRIDA online free url in huggingface.co:

https://huggingface.co/ai-forever/FRIDA

FRIDA install

FRIDA is an open source model from GitHub that offers a free installation service, and any user can find FRIDA on GitHub to install. At the same time, huggingface.co provides the effect of FRIDA install, users can directly use FRIDA installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

FRIDA install url in huggingface.co:

https://huggingface.co/ai-forever/FRIDA

Url of FRIDA

Provider of FRIDA huggingface.co

ai-forever
ORGANIZATIONS

Other API from ai-forever

huggingface.co

Total runs: 525.5K
Run Growth: 507.4K
Growth Rate: 96.56%
Updated: November 03 2023
huggingface.co

Total runs: 10.6K
Run Growth: 1.5K
Growth Rate: 13.75%
Updated: December 05 2023
huggingface.co

Total runs: 5.9K
Run Growth: 3.5K
Growth Rate: 60.26%
Updated: December 11 2023
huggingface.co

Total runs: 2.3K
Run Growth: -408
Growth Rate: -17.86%
Updated: December 05 2023
huggingface.co

Total runs: 1.3K
Run Growth: -6.3K
Growth Rate: -493.39%
Updated: December 05 2023
huggingface.co

Total runs: 315
Run Growth: 158
Growth Rate: 50.16%
Updated: January 26 2023
huggingface.co

Total runs: 0
Run Growth: 0
Growth Rate: 0.00%
Updated: December 24 2021
huggingface.co

Total runs: 0
Run Growth: 0
Growth Rate: 0.00%
Updated: June 08 2023
huggingface.co

Total runs: 0
Run Growth: 0
Growth Rate: 0.00%
Updated: September 21 2021