The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including
GTE-large
,
GTE-base
, and
GTE-small
. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including
information retrieval
,
semantic textual similarity
,
text reranking
, etc.
Metrics
We compared the performance of the GTE models with other popular text embedding models on the MTEB benchmark. For more detailed comparison results, please refer to the
MTEB leaderboard
.
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
defaverage_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
input_texts = [
"what is the capital of China?",
"how to implement quick sort in python?",
"Beijing",
"sorting algorithms"
]
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large")
model = AutoModel.from_pretrained("thenlper/gte-large")
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100print(scores.tolist())
Use with sentence-transformers:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = ['That is a happy person', 'That is a very happy person']
model = SentenceTransformer('thenlper/gte-large')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))
Limitation
This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.
Citation
If you find our paper or models helpful, please consider citing them as follows:
@article{li2023towards,
title={Towards general text embeddings with multi-stage contrastive learning},
author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
journal={arXiv preprint arXiv:2308.03281},
year={2023}
}
Runs of thenlper gte-large on huggingface.co
611.9K
Total runs
0
24-hour runs
-39.6K
3-day runs
-100.3K
7-day runs
-709.5K
30-day runs
More Information About gte-large huggingface.co Model
gte-large huggingface.co is an AI model on huggingface.co that provides gte-large's model effect (), which can be used instantly with this thenlper gte-large model. huggingface.co supports a free trial of the gte-large model, and also provides paid use of the gte-large. Support call gte-large model through api, including Node.js, Python, http.
gte-large huggingface.co is an online trial and call api platform, which integrates gte-large's modeling effects, including api services, and provides a free online trial of gte-large, you can try gte-large online for free by clicking the link below.
thenlper gte-large online free url in huggingface.co:
gte-large is an open source model from GitHub that offers a free installation service, and any user can find gte-large on GitHub to install. At the same time, huggingface.co provides the effect of gte-large install, users can directly use gte-large installed effect in huggingface.co for debugging and trial. It also supports api for free installation.