This is a
sentence-transformers
model: It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for
semantic search
. It has been trained on 215M (question, answer) pairs from diverse sources. For an introduction to semantic search, have a look at:
SBERT.net - Semantic Search
from sentence_transformers import SentenceTransformer, util
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]
#Load the model
model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')
#Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)
#Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))
#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
#Output passages & scoresfor doc, score in doc_score_pairs:
print(score, doc)
Usage (HuggingFace Transformers)
Without
sentence-transformers
, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the correct pooling-operation on-top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
#CLS Pooling - Take output from first tokendefcls_pooling(model_output):
return model_output.last_hidden_state[:,0]
#Encode textdefencode(texts):
# Tokenize sentences
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddingswith torch.no_grad():
model_output = model(**encoded_input, return_dict=True)
# Perform pooling
embeddings = cls_pooling(model_output)
return embeddings
# Sentences we want sentence embeddings for
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-mpnet-base-dot-v1")
model = AutoModel.from_pretrained("sentence-transformers/multi-qa-mpnet-base-dot-v1")
#Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)
#Compute dot score between query and all document embeddings
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))
#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
#Output passages & scoresfor doc, score in doc_score_pairs:
print(score, doc)
Technical Details
In the following some technical details how this model must be used:
Setting
Value
Dimensions
768
Produces normalized embeddings
No
Pooling-Method
CLS pooling
Suitable score functions
dot-product (e.g.
util.dot_score
)
Background
The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
contrastive learning objective. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
Our model is intented to be used for semantic search: It encodes queries / questions and text paragraphs in a dense vector space. It finds relevant documents for the given passages.
Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.
Training procedure
The full training script is accessible in this current repository:
train_script.py
.
Pre-training
We use the pretrained
mpnet-base
model. Please refer to the model card for more detailed information about the pre-training procedure.
Training
We use the concatenation from multiple datasets to fine-tune our model. In total we have about 215M (question, answer) pairs.
We sampled each dataset given a weighted probability which configuration is detailed in the
data_config.json
file.
The model was trained with
MultipleNegativesRankingLoss
using CLS-pooling, dot-product as similarity function, and a scale of 1.
Dataset
Number of training tuples
WikiAnswers
Duplicate question pairs from WikiAnswers
77,427,422
PAQ
Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia
Runs of sentence-transformers multi-qa-mpnet-base-dot-v1 on huggingface.co
1.3M
Total runs
38.8K
24-hour runs
65.8K
3-day runs
96.2K
7-day runs
-388.9K
30-day runs
More Information About multi-qa-mpnet-base-dot-v1 huggingface.co Model
multi-qa-mpnet-base-dot-v1 huggingface.co
multi-qa-mpnet-base-dot-v1 huggingface.co is an AI model on huggingface.co that provides multi-qa-mpnet-base-dot-v1's model effect (), which can be used instantly with this sentence-transformers multi-qa-mpnet-base-dot-v1 model. huggingface.co supports a free trial of the multi-qa-mpnet-base-dot-v1 model, and also provides paid use of the multi-qa-mpnet-base-dot-v1. Support call multi-qa-mpnet-base-dot-v1 model through api, including Node.js, Python, http.
multi-qa-mpnet-base-dot-v1 huggingface.co is an online trial and call api platform, which integrates multi-qa-mpnet-base-dot-v1's modeling effects, including api services, and provides a free online trial of multi-qa-mpnet-base-dot-v1, you can try multi-qa-mpnet-base-dot-v1 online for free by clicking the link below.
sentence-transformers multi-qa-mpnet-base-dot-v1 online free url in huggingface.co:
multi-qa-mpnet-base-dot-v1 is an open source model from GitHub that offers a free installation service, and any user can find multi-qa-mpnet-base-dot-v1 on GitHub to install. At the same time, huggingface.co provides the effect of multi-qa-mpnet-base-dot-v1 install, users can directly use multi-qa-mpnet-base-dot-v1 installed effect in huggingface.co for debugging and trial. It also supports api for free installation.
multi-qa-mpnet-base-dot-v1 install url in huggingface.co: