nomic-embed-text-v1 replicate.com api & lucataco nomic-embed-text-v1 github AI Model

Introduction of nomic-embed-text-v1

Model Details of nomic-embed-text-v1

Readme

nomic-embed-text-v1: A Reproducible Long Context (8192) Text Embedder

nomic-embed-text-v1 is 8192 context length text encoder that surpasses OpenAI text-embedding-ada-002 and text-embedding-3-small performance on short and long context tasks.

Name	SeqLen	MTEB	LoCo	Jina Long Context	Open Weights	Open Training Code	Open Data
nomic-embed-text-v1	8192	62.39	85.53	54.16	✅	✅	✅
jina-embeddings-v2-base-en	8192	60.39	85.45	51.90	✅	❌	❌
text-embedding-3-small	8191	62.26	82.40	58.20	❌	❌	❌
text-embedding-ada-002	8191	60.99	52.7	55.25	❌	❌	❌

Hosted Inference API

The easiest way to get started with Nomic Embed is through the Nomic Embedding API.

Generating embeddings with the nomic Python client is as easy as

from nomic import embed

output = embed.text(
    texts=['Nomic Embedding API', '#keepAIOpen'],
    model='nomic-embed-text-v1',
    task_type='search_document'
)

print(output)

For more information, see the API reference

Data Visualization

Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!

Training Details

We train our embedder using a multi-stage training pipeline. Starting from a long-context BERT model , the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.

In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.

For more details, see the Nomic Embed Technical Report and corresponding blog post .

Training data to train the models is released in its entirety. For more details, see the contrastors repository

Usage

Note nomic-embed-text requires prefixes! We support the prefixes [search_query, search_document, classification, clustering] . For retrieval applications, you should prepend search_document for all your documents and search_query for your queries.

Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
embeddings = model.encode(sentences)
print(embeddings)

Transformers

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1', trust_remote_code=True)
model.eval()

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
print(embeddings)

The model natively supports scaling of the sequence length past 2048 tokens. To do so,

- tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
+ tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_max_length=8192)


- model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1', trust_remote_code=True)
+ model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1', trust_remote_code=True, rotary_scaling_factor=2)

Transformers.js

import { pipeline } from '@xenova/transformers';

// Create a feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'nomic-ai/nomic-embed-text-v1', {
    quantized: false, // Comment out this line to use the quantized version
});

// Compute sentence embeddings
const texts = ['What is TSNE?', 'Who is Laurens van der Maaten?'];
const embeddings = await extractor(texts, { pooling: 'mean', normalize: true });
console.log(embeddings);

Citation

If you find the model, dataset, or training code useful, please cite our work

@misc{nussbaum2024nomic,
      title={Nomic Embed: Training a Reproducible Long Context Text Embedder}, 
      author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
      year={2024},
      eprint={2402.01613},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Pricing of nomic-embed-text-v1 replicate.com

Run time and cost

This model costs approximately $0.00025 to run on Replicate, or 4000 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker .

This model runs on Nvidia T4 GPU hardware . Predictions typically complete within 2 seconds.

Runs of lucataco nomic-embed-text-v1 on replicate.com

3.6K

Total runs

24-hour runs

3-day runs

7-day runs

30-day runs

More Information About nomic-embed-text-v1 replicate.com Model

More nomic-embed-text-v1 license Visit here:

https://huggingface.co/models?license=license%3Aapache-2.0

nomic-embed-text-v1 replicate.com

nomic-embed-text-v1 replicate.com is an AI model on replicate.com that provides nomic-embed-text-v1's model effect (nomic-embed-text-v1 is 8192 context length text encoder that surpasses OpenAI text-embedding-ada-002 and text-embedding-3-small performance on short and long context tasks), which can be used instantly with this lucataco nomic-embed-text-v1 model. replicate.com supports a free trial of the nomic-embed-text-v1 model, and also provides paid use of the nomic-embed-text-v1. Support call nomic-embed-text-v1 model through api, including Node.js, Python, http.

nomic-embed-text-v1 replicate.com Url

https://replicate.com/lucataco/nomic-embed-text-v1

lucataco nomic-embed-text-v1 online free

nomic-embed-text-v1 replicate.com is an online trial and call api platform, which integrates nomic-embed-text-v1's modeling effects, including api services, and provides a free online trial of nomic-embed-text-v1, you can try nomic-embed-text-v1 online for free by clicking the link below.

lucataco nomic-embed-text-v1 online free url in replicate.com:

https://replicate.com/lucataco/nomic-embed-text-v1

nomic-embed-text-v1 install

nomic-embed-text-v1 is an open source model from GitHub that offers a free installation service, and any user can find nomic-embed-text-v1 on GitHub to install. At the same time, replicate.com provides the effect of nomic-embed-text-v1 install, users can directly use nomic-embed-text-v1 installed effect in replicate.com for debugging and trial. It also supports api for free installation.

nomic-embed-text-v1 install url in replicate.com:

https://replicate.com/lucataco/nomic-embed-text-v1

nomic-embed-text-v1 install url in github:

https://github.com/lucataco/cog-nomic-embed-text-v1

lucataco / nomic-embed-text-v1

Introduction of nomic-embed-text-v1

Model Details of nomic-embed-text-v1

Readme

nomic-embed-text-v1: A Reproducible Long Context (8192) Text Embedder

Hosted Inference API

Data Visualization

Training Details

Usage

Sentence Transformers

Transformers

Transformers.js

Join the Nomic Community

Citation

Pricing of nomic-embed-text-v1 replicate.com

Run time and cost

Runs of lucataco nomic-embed-text-v1 on replicate.com

More Information About nomic-embed-text-v1 replicate.com Model

More nomic-embed-text-v1 license Visit here:

nomic-embed-text-v1 replicate.com

nomic-embed-text-v1 replicate.com Url

lucataco nomic-embed-text-v1 online free

lucataco nomic-embed-text-v1 online free url in replicate.com:

nomic-embed-text-v1 install

nomic-embed-text-v1 install url in replicate.com:

nomic-embed-text-v1 install url in github:

Url of nomic-embed-text-v1

nomic-embed-text-v1 replicate.com Url

nomic-embed-text-v1 Github

nomic-embed-text-v1 Owner Github

Provider of nomic-embed-text-v1 replicate.com

Other API from lucataco