Model Details of biencoder-camembert-base-mmarcoFR
biencoder-camembert-base-mmarcoFR
This is a dense single-vector bi-encoder model for
French
that can be used for semantic search. The model maps queries and passages to 768-dimensional dense vectors which are used to compute relevance through cosine similarity.
Start by installing the
library
:
pip install -U sentence-transformers
. Then, you can use the model like this:
from sentence_transformers import SentenceTransformer
queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]
model = SentenceTransformer('antoinelouis/biencoder-camembert-base-mmarcoFR')
q_embeddings = model.encode(queries, normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)
similarity = q_embeddings @ p_embeddings.T
print(similarity)
Using FlagEmbedding
Start by installing the
library
:
pip install -U FlagEmbedding
. Then, you can use the model like this:
from FlagEmbedding import FlagModel
queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]
model = FlagModel('antoinelouis/biencoder-camembert-base-mmarcoFR')
q_embeddings = model.encode(queries, normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)
similarity = q_embeddings @ p_embeddings.T
print(similarity)
Using Transformers
Start by installing the
library
:
pip install -U transformers
. Then, you can use the model like this:
from transformers import AutoTokenizer, AutoModel
from torch.nn.functional import normalize
defmean_pooling(model_output, attention_mask):
""" Perform mean pooling on-top of the contextualized word embeddings, while ignoring mask tokens in the mean computation."""
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]
tokenizer = AutoTokenizer.from_pretrained('antoinelouis/biencoder-camembert-base-mmarcoFR')
model = AutoModel.from_pretrained('antoinelouis/biencoder-camembert-base-mmarcoFR')
q_input = tokenizer(queries, padding=True, truncation=True, return_tensors='pt')
p_input = tokenizer(passages, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
q_output = model(**encoded_queries)
p_output = model(**encoded_passages)
q_embeddings = mean_pooling(q_output, q_input['attention_mask'])
q_embedddings = normalize(q_embeddings, p=2, dim=1)
p_embeddings = mean_pooling(p_output, p_input['attention_mask'])
p_embedddings = normalize(p_embeddings, p=2, dim=1)
similarity = q_embeddings @ p_embeddings.T
print(similarity)
Evaluation
The model is evaluated on the smaller development set of
mMARCO-fr
, which consists of 6,980 queries for a corpus of
8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).
To see how it compares to other neural retrievers in French, check out the
DécouvrIR
leaderboard.
Training
Data
We use the French training samples from the
mMARCO
dataset, a multilingual machine-translated version of MS MARCO that contains 8.8M passages and 539K training queries. We do not employ the BM25 netaives provided by the official dataset but instead sample harder negatives mined from 12 distinct dense retrievers, using the
msmarco-hard-negatives
distillation dataset.
Implementation
The model is initialized from the
camembert-base
checkpoint and optimized via the cross-entropy loss (as in
DPR
) with a temperature of 0.05. It is fine-tuned on one 32GB NVIDIA V100 GPU for 20 epochs (i.e., 65.7k steps) using the AdamW optimizer with a batch size of 152, a peak learning rate of 2e-5 with warm up along the first 500 steps and linear scheduling. We set the maximum sequence lengths for both the questions and passages to 128 tokens. We use the cosine similarity to compute relevance scores.
Citation
@online{louis2024decouvrir,
author = 'Antoine Louis',
title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
publisher = 'Hugging Face',
month = 'mar',
year = '2024',
url = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
}
Runs of antoinelouis biencoder-camembert-base-mmarcoFR on huggingface.co
5.3K
Total runs
711
24-hour runs
-194
3-day runs
358
7-day runs
3.1K
30-day runs
More Information About biencoder-camembert-base-mmarcoFR huggingface.co Model
More biencoder-camembert-base-mmarcoFR license Visit here:
biencoder-camembert-base-mmarcoFR huggingface.co is an AI model on huggingface.co that provides biencoder-camembert-base-mmarcoFR's model effect (), which can be used instantly with this antoinelouis biencoder-camembert-base-mmarcoFR model. huggingface.co supports a free trial of the biencoder-camembert-base-mmarcoFR model, and also provides paid use of the biencoder-camembert-base-mmarcoFR. Support call biencoder-camembert-base-mmarcoFR model through api, including Node.js, Python, http.
biencoder-camembert-base-mmarcoFR huggingface.co is an online trial and call api platform, which integrates biencoder-camembert-base-mmarcoFR's modeling effects, including api services, and provides a free online trial of biencoder-camembert-base-mmarcoFR, you can try biencoder-camembert-base-mmarcoFR online for free by clicking the link below.
antoinelouis biencoder-camembert-base-mmarcoFR online free url in huggingface.co:
biencoder-camembert-base-mmarcoFR is an open source model from GitHub that offers a free installation service, and any user can find biencoder-camembert-base-mmarcoFR on GitHub to install. At the same time, huggingface.co provides the effect of biencoder-camembert-base-mmarcoFR install, users can directly use biencoder-camembert-base-mmarcoFR installed effect in huggingface.co for debugging and trial. It also supports api for free installation.
biencoder-camembert-base-mmarcoFR install url in huggingface.co: