text2vec-bge-large-chinese huggingface.co api & shibing624 text2vec-bge-large-chinese github AI Model

Introduction of text2vec-bge-large-chinese

Model Details of text2vec-bge-large-chinese

shibing624/text2vec-bge-large-chinese

This is a CoSENT(Cosine Sentence) model: shibing624/text2vec-bge-large-chinese.

It maps sentences to a 1024 dimensional dense vector space and can be used for tasks like sentence embeddings, text matching or semantic search.

training dataset: https://huggingface.co/datasets/shibing624/nli-zh-all/tree/main/text2vec-base-chinese-paraphrase-dataset
base model: https://huggingface.co/BAAI/bge-large-zh-noinstruct
max_seq_length: 256
best epoch: 4
sentence embedding dim: 1024

Evaluation

For an automated evaluation of this model, see the Evaluation Benchmark : text2vec

Release Models

本项目release模型的中文匹配评测结果：

Arch	BaseModel	Model	ATEC	BQ	LCQMC	PAWSX	STS-B	SOHU-dd	SOHU-dc	Avg	QPS
Word2Vec	word2vec	w2v-light-tencent-chinese	20.00	31.49	59.46	2.57	55.78	55.04	20.70	35.03	23769
SBERT	xlm-roberta-base	sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2	18.42	38.52	63.96	10.14	78.90	63.01	52.28	46.46	3138
CoSENT	hfl/chinese-macbert-base	shibing624/text2vec-base-chinese	31.93	42.67	70.16	17.21	79.30	70.27	50.42	51.61	3008
CoSENT	hfl/chinese-lert-large	GanymedeNil/text2vec-large-chinese	32.61	44.59	69.30	14.51	79.44	73.01	59.04	53.12	2092
CoSENT	nghuyong/ernie-3.0-base-zh	shibing624/text2vec-base-chinese-sentence	43.37	61.43	73.48	38.90	78.25	70.60	53.08	59.87	3089
CoSENT	nghuyong/ernie-3.0-base-zh	shibing624/text2vec-base-chinese-paraphrase	44.89	63.58	74.24	40.90	78.93	76.70	63.30	63.08	3066
CoSENT	sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2	shibing624/text2vec-base-multilingual	32.39	50.33	65.64	32.56	74.45	68.88	51.17	53.67	3138
CoSENT	BAAI/bge-large-zh-noinstruct	shibing624/text2vec-bge-large-chinese	38.41	61.34	71.72	35.15	76.44	71.81	63.15	59.72	844

说明：

结果评测指标：spearman系数
shibing624/text2vec-base-chinese 模型，是用CoSENT方法训练，基于 hfl/chinese-macbert-base 在中文STS-B数据训练得到，并在中文STS-B测试集评估达到较好效果，运行 examples/training_sup_text_matching_model.py 代码可训练模型，模型文件已经上传HF model hub，中文通用语义匹配任务推荐使用
shibing624/text2vec-base-chinese-sentence 模型，是用CoSENT方法训练，基于 nghuyong/ernie-3.0-base-zh 用人工挑选后的中文STS数据集 shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset 训练得到，并在中文各NLI测试集评估达到较好效果，运行 examples/training_sup_text_matching_model_jsonl_data.py 代码可训练模型，模型文件已经上传HF model hub，中文s2s(句子vs句子)语义匹配任务推荐使用
shibing624/text2vec-base-chinese-paraphrase 模型，是用CoSENT方法训练，基于 nghuyong/ernie-3.0-base-zh 用人工挑选后的中文STS数据集 shibing624/nli-zh-all/text2vec-base-chinese-paraphrase-dataset ，数据集相对于 shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset 加入了s2p(sentence to paraphrase)数据，强化了其长文本的表征能力，并在中文各NLI测试集评估达到SOTA，运行 examples/training_sup_text_matching_model_jsonl_data.py 代码可训练模型，模型文件已经上传HF model hub，中文s2p(句子vs段落)语义匹配任务推荐使用
shibing624/text2vec-base-multilingual 模型，是用CoSENT方法训练，基于 sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 用人工挑选后的多语言STS数据集 shibing624/nli-zh-all/text2vec-base-multilingual-dataset 训练得到，并在中英文测试集评估相对于原模型效果有提升，运行 examples/training_sup_text_matching_model_jsonl_data.py 代码可训练模型，模型文件已经上传HF model hub，多语言语义匹配任务推荐使用
shibing624/text2vec-bge-large-chinese 模型，是用CoSENT方法训练，基于 BAAI/bge-large-zh-noinstruct 用人工挑选后的中文STS数据集 shibing624/nli-zh-all/text2vec-base-chinese-paraphrase-dataset 训练得到，并在中文测试集评估相对于原模型效果有提升，在短文本区分度上提升明显，运行 examples/training_sup_text_matching_model_jsonl_data.py 代码可训练模型，模型文件已经上传HF model hub，中文s2s(句子vs句子)语义匹配任务推荐使用
w2v-light-tencent-chinese 是腾讯词向量的Word2Vec模型，CPU加载使用，适用于中文字面匹配任务和缺少数据的冷启动情况
各预训练模型均可以通过transformers调用，如MacBERT模型： --model_name hfl/chinese-macbert-base 或者roberta模型： --model_name uer/roberta-medium-wwm-chinese-cluecorpussmall
为测评模型的鲁棒性，加入了未训练过的SOHU测试集，用于测试模型的泛化能力；为达到开箱即用的实用效果，使用了搜集到的各中文匹配数据集，数据集也上传到HF datasets 链接见下方
中文匹配任务实验表明，pooling最优是 EncoderType.FIRST_LAST_AVG 和 EncoderType.MEAN ，两者预测效果差异很小
中文匹配评测结果复现，可以下载中文匹配数据集到 examples/data ，运行 tests/model_spearman.py 代码复现评测结果
QPS的GPU测试环境是Tesla V100，显存32GB

模型训练实验报告：实验报告

Usage (text2vec)

Using this model becomes easy when you have text2vec installed:

pip install -U text2vec

Then you can use the model like this:

from text2vec import SentenceModel
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']

model = SentenceModel('shibing624/text2vec-bge-large-chinese')
embeddings = model.encode(sentences)
print(embeddings)

Usage (HuggingFace Transformers)

Without text2vec , you can use the model like this:

First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

Install transformers:

pip install transformers

Then load model and predict:

from transformers import BertTokenizer, BertModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Load model from HuggingFace Hub
tokenizer = BertTokenizer.from_pretrained('shibing624/text2vec-bge-large-chinese')
model = BertModel.from_pretrained('shibing624/text2vec-bge-large-chinese')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Usage (sentence-transformers)

sentence-transformers is a popular library to compute dense vector representations for sentences.

Install sentence-transformers:

pip install -U sentence-transformers

Then load model and predict:

from sentence_transformers import SentenceTransformer

m = SentenceTransformer("shibing624/text2vec-bge-large-chinese")
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']

sentence_embeddings = m.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)

Full Model Architecture

CoSENT(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: ErnieModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_mean_tokens': True})
)

Intended uses

Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.

By default, input text longer than 256 word pieces is truncated.

Training procedure

Pre-training

We use the pretrained https://huggingface.co/BAAI/bge-large-zh-noinstruct model. Please refer to the model card for more detailed information about the pre-training procedure.

Fine-tuning

We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch. We then apply the rank loss by comparing with true pairs and false pairs.

Citing & Authors

This model was trained by text2vec .

If you find this model helpful, feel free to cite:

@software{text2vec,
  author = {Ming Xu},
  title = {text2vec: A Tool for Text to Vector},
  year = {2023},
  url = {https://github.com/shibing624/text2vec},
}

Runs of shibing624 text2vec-bge-large-chinese on huggingface.co

867

Total runs

24-hour runs

-123

3-day runs

-145

7-day runs

-131

30-day runs

More Information About text2vec-bge-large-chinese huggingface.co Model

More text2vec-bge-large-chinese license Visit here:

https://choosealicense.com/licenses/apache-2.0

text2vec-bge-large-chinese huggingface.co

text2vec-bge-large-chinese huggingface.co is an AI model on huggingface.co that provides text2vec-bge-large-chinese's model effect (), which can be used instantly with this shibing624 text2vec-bge-large-chinese model. huggingface.co supports a free trial of the text2vec-bge-large-chinese model, and also provides paid use of the text2vec-bge-large-chinese. Support call text2vec-bge-large-chinese model through api, including Node.js, Python, http.

text2vec-bge-large-chinese huggingface.co Url

https://huggingface.co/shibing624/text2vec-bge-large-chinese

shibing624 text2vec-bge-large-chinese online free

text2vec-bge-large-chinese huggingface.co is an online trial and call api platform, which integrates text2vec-bge-large-chinese's modeling effects, including api services, and provides a free online trial of text2vec-bge-large-chinese, you can try text2vec-bge-large-chinese online for free by clicking the link below.

shibing624 text2vec-bge-large-chinese online free url in huggingface.co:

https://huggingface.co/shibing624/text2vec-bge-large-chinese

text2vec-bge-large-chinese install

text2vec-bge-large-chinese is an open source model from GitHub that offers a free installation service, and any user can find text2vec-bge-large-chinese on GitHub to install. At the same time, huggingface.co provides the effect of text2vec-bge-large-chinese install, users can directly use text2vec-bge-large-chinese installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

text2vec-bge-large-chinese install url in huggingface.co:

https://huggingface.co/shibing624/text2vec-bge-large-chinese

huggingface.co

shibing624/text2vec-base-chinese

Total runs: 1.2M

Run Growth: 520.5K

Growth Rate: 44.70%

Updated: 2024年11月14日

huggingface.co

shibing624/text2vec-base-multilingual

Total runs: 196.7K

Run Growth: 179.4K

Growth Rate: 92.91%

Updated: 2024年7月31日

huggingface.co

shibing624/macbert4csc-base-chinese

Total runs: 70.5K

Run Growth: 59.3K

Growth Rate: 85.55%

Updated: 2024年9月27日

huggingface.co

shibing624/text2vec-base-chinese-paraphrase

Total runs: 37.5K

Run Growth: -15.1K

Growth Rate: -41.68%

Updated: 2024年10月28日

huggingface.co

shibing624/chinese-text-correction-1.5b

Total runs: 1.9K

Run Growth: 883

Growth Rate: 38.97%

Updated: 2024年10月14日

huggingface.co

shibing624/chinese-alpaca-plus-7b-hf

Total runs: 1.6K

Run Growth: 558

Growth Rate: 36.54%

Updated: 2023年12月15日

huggingface.co

shibing624/chinese-llama-plus-13b-hf

Total runs: 1.2K

Run Growth: 232

Growth Rate: 20.00%

Updated: 2023年12月15日

huggingface.co

shibing624/chinese-alpaca-plus-13b-hf

Total runs: 1.2K

Run Growth: 220

Growth Rate: 19.21%

Updated: 2023年12月15日

huggingface.co

shibing624/text2vec-base-chinese-sentence

Total runs: 1.2K

Run Growth: -2.9K

Growth Rate: -239.78%

Updated: 2024年10月28日

huggingface.co

shibing624/mengzi-t5-base-chinese-correction

Total runs: 1.1K

Run Growth: 94

Growth Rate: 8.83%

Updated: 2024年2月19日

huggingface.co

shibing624/chinese-text-correction-7b

Total runs: 671

Run Growth: 28

Growth Rate: 1.90%

Updated: 2024年10月14日

huggingface.co

shibing624/bert4ner-base-chinese

Total runs: 578

Run Growth: 240

Growth Rate: 40.89%

Updated: 2024年2月19日

huggingface.co

shibing624/gpt2-dialogbot-base-chinese

Total runs: 399

Run Growth: 232

Growth Rate: 57.86%

Updated: 2023年3月19日

huggingface.co

shibing624/code-autocomplete-gpt2-base

Total runs: 173

Run Growth: -57

Growth Rate: -33.33%

Updated: 2023年3月19日

huggingface.co

shibing624/vicuna-baichuan-13b-chat

Total runs: 163

Run Growth: 21

Growth Rate: 12.65%

Updated: 2024年1月23日

huggingface.co

shibing624/t5-chinese-couplet

Total runs: 112

Run Growth: -1

Growth Rate: -8.33%

Updated: 2023年3月28日

huggingface.co

shibing624/ziya-llama-13b-medical-merged

Total runs: 109

Run Growth: -275

Growth Rate: -264.42%

Updated: 2024年2月19日

huggingface.co

shibing624/parrots-chinese-hubert-base

Total runs: 97

Run Growth: -5

Growth Rate: -4.81%

Updated: 2024年11月11日

huggingface.co

shibing624/chatglm3-6b-csc-chinese-lora

Total runs: 85

Run Growth: -4

Growth Rate: -4.71%

Updated: 2024年2月19日

huggingface.co

shibing624/code-autocomplete-distilgpt2-python

Total runs: 75

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年2月19日

huggingface.co

shibing624/parrots-chinese-roberta-wwm-ext-large

Total runs: 64

Run Growth: -172

Growth Rate: -226.32%

Updated: 2024年2月12日

huggingface.co

shibing624/asian-role

Total runs: 60

Run Growth: -24

Growth Rate: -37.50%

Updated: 2024年2月19日

huggingface.co

shibing624/bart4csc-base-chinese

Total runs: 44

Run Growth: 16

Growth Rate: 39.02%

Updated: 2024年2月19日

huggingface.co

shibing624/chinese-text-correction-7b-lora

Total runs: 28

Run Growth: -1

Growth Rate: -3.23%

Updated: 2024年10月14日

huggingface.co

shibing624/llama-3-8b-instruct-262k-chinese

Total runs: 25

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年4月29日

huggingface.co

shibing624/bert4ner-base-uncased

Total runs: 15

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年2月19日

huggingface.co

shibing624/chinese-text-correction-1.5b-lora

Total runs: 14

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年10月14日

huggingface.co

shibing624/vicuna-baichuan-13b-chat-lora

Total runs: 13

Run Growth: -2

Growth Rate: -15.38%

Updated: 2024年2月19日

huggingface.co

shibing624/bertspan4ner-base-chinese

Total runs: 12

Run Growth: -16

Growth Rate: -133.33%

Updated: 2024年2月19日

huggingface.co

shibing624/chatglm-6b-belle-zh-lora

Total runs: 8

Run Growth: -27

Growth Rate: -337.50%

Updated: 2024年2月19日

huggingface.co

shibing624/songnet-base-chinese-couplet

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2022年11月27日

huggingface.co

shibing624/ziya-llama-13b-medical-lora

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年2月3日

huggingface.co

shibing624/chinese-alpaca-plus-13b-pth

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2023年5月19日

huggingface.co

shibing624/text2vec-word2vec-tencent-chinese

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2025年1月2日

huggingface.co

shibing624/chinese-kenlm-klm

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2023年11月8日

huggingface.co

shibing624/llama-3-8b-instruct-262k-chinese-lora

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年4月29日

huggingface.co

shibing624/chatglm-6b-csc-zh-lora

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2023年12月15日

huggingface.co

shibing624/parrots-gpt-sovits-speaker-maimai

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年2月19日

huggingface.co

shibing624/parrots-gpt-sovits-speaker

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年2月19日

huggingface.co

shibing624/songnet-base-chinese

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年2月19日

huggingface.co

shibing624/llama-13b-belle-zh-lora

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2023年4月14日

huggingface.co

shibing624/songnet-base-chinese-songci

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: 2024年2月19日

shibing624 / text2vec-bge-large-chinese

Introduction of text2vec-bge-large-chinese

Model Details of text2vec-bge-large-chinese

shibing624/text2vec-bge-large-chinese

Evaluation

Release Models

Usage (text2vec)

Usage (HuggingFace Transformers)

Usage (sentence-transformers)

Full Model Architecture

Intended uses

Training procedure

Pre-training

Fine-tuning

Citing & Authors

Runs of shibing624 text2vec-bge-large-chinese on huggingface.co

More Information About text2vec-bge-large-chinese huggingface.co Model

More text2vec-bge-large-chinese license Visit here:

text2vec-bge-large-chinese huggingface.co

text2vec-bge-large-chinese huggingface.co Url

shibing624 text2vec-bge-large-chinese online free

shibing624 text2vec-bge-large-chinese online free url in huggingface.co:

text2vec-bge-large-chinese install

text2vec-bge-large-chinese install url in huggingface.co:

Url of text2vec-bge-large-chinese

text2vec-bge-large-chinese huggingface.co Url

Provider of text2vec-bge-large-chinese huggingface.co

Other API from shibing624