allenai / aspire-sentence-embedder

huggingface.co
Total runs: 140
24-hour runs: 0
7-day runs: -8
30-day runs: -10
Model's Last Updated: 2022年10月3日
feature-extraction

Introduction of aspire-sentence-embedder

Model Details of aspire-sentence-embedder


language: en

license: apache-2.0


Overview

Model included in a paper for modeling fine grained similarity between documents:

Title : "Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity"

Authors : Sheshera Mysore, Arman Cohan, Tom Hope

Paper : https://arxiv.org/abs/2111.08366

Github : https://github.com/allenai/aspire

Note : In the context of the paper, this model is referred to as cosentbert and represents a baseline sentence encoder for scientific text. The paper trains two versions of cosentbert , one for biomedical scientific text and another one for computer science text. This released model is trained on a union of all available data across scientific domains in the Semantic Scholar Open Research Corpus (S2ORC) dataset. This difference in training data leads to different, though close, evaluation performance than in the paper.

Model Card

Model description: This model represents a SciBERT based sentence encoder pre-trained for scientific text similarity. The model represents a sentence with a single vector obtained by reading the CLS token for the sentence.

Training data: The model is trained on sets of co-citation context sentences referencing the same set of papers in a contrastive learning setup. These sentences can often be considered as paraphrases since co-citation sentences citing the same papers often describe similar aspects of the co-cited papers. The model is trained on 4.3 million sentence pairs of this type. In training the model negative examples for the contrastive loss are obtained as random in-batch negatives. An example pair of sentences used for training is as follows:

"The idea of distant supervision has been proposed and used widely in Relation Extraction (Mintz et al., 2009; Riedel et al., 2010; Hoffmann et al., 2011; Surdeanu et al., 2012) , where the source of labels is an external knowledge base."

"Distant supervision [31, 43, 21, 49] generates training data automatically by aligning texts and a knowledge base (KB) (see Fig. 1 )."

Training procedure: The model was trained with the Adam Optimizer and a learning rate of 2e-5 with 1000 warm-up steps followed by linear decay of the learning rate. The model training convergence is checked with the loss on a held out dev set consisting of co-citation context pairs. All the training data used was in English.

Intended uses & limitations: This model is trained for sentence similarity tasks in scientific text and is best used as a sentence encoder. However with appropriate fine-tuning the model can also be used for other tasks such as classification. Note that about 50% of the training data consists of text from biomedical text and performance may be superior on text from bio-medicine and similar domains.

How to use: This model can be used as a BERT model via the transformers library:


from transformers import AutoModel, AutoTokenizer

aspire_sent = AutoModel.from_pretrained('allenai/aspire-sentence-embedder')

aspire_tok = AutoTokenizer.from_pretrained('allenai/aspire-sentence-embedder')

s='We present a new scientific document similarity model based on matching fine-grained aspects of texts.'

inputs = aspire_tok(s, padding=True, truncation=True, return_tensors="pt", max_length=512)

result = aspire_sent(\*\*inputs)

clsrep = result.last_hidden_state[:,0,:]

OR via the sentence_transformers library:


from sentence_transformers import SentenceTransformer, models

word_embedding_model = models.Transformer('allenai/aspire-sentence-embedder', max_seq_length=512)

pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode='cls')

aspire_sb = SentenceTransformer(modules=[word_embedding_model, pooling_model])

clsrep_sb = sentbert_model.encode([s])

Variable and metrics:

Since the paper this model was trained for proposes methods for similarity of scientific abstracts, this model is evaluated on information retrieval datasets with document level queries. The datasets used for the paper include RELISH (biomedical/English), TRECCOVID (biomedical/English), and CSFCube (computer science/English). These are all detailed on github and in our paper . RELISH and TRECCOVID represent a abstract level retrieval task, where given a query scientific abstract the task requires the retrieval of relevant candidate abstracts. CSFCube presents a slightly different task and presents a set of finer-grained sentences in the abstract based on which a finer-grained retrieval must be made. This task represents the closest task to a sentence similarity task.

In using this sentence level model for abstract level retrieval we rank documents by the minimal L2 distance between the sentences in the query and candidate abstract.

Evaluation results:

The released model aspire-sentence-embedder is compared against 1) all-mpnet-base-v2 a sentence-bert model trained on ~1 billion training examples, 2) paraphrase-TinyBERT-L6-v2 a sentence-bert model trained on paraphrase pairs, and 3) the cosentbert models used in our paper.

| | CSFCube aggregated | CSFCube aggregated | TRECCOVID | TRECCOVID | RELISH | RELISH |

|-------------------------------------------:|:------------------:|:-------:|:---------:|:-------:|:------:|:-------:|

| | MAP | NDCG%20 | MAP | NDCG%20 | MAP | NDCG%20 |

| all-mpnet-base-v2 | 34.64 | 54.94 | 17.35 | 43.87 | 52.92 | 69.69 |

| paraphrase-TinyBERT-L6-v2 | 26.77 | 48.57 | 11.12 | 34.85 | 50.80 | 67.35 |

| cosentbert | 28.95 | 50.68 | 12.80 | 38.07 | 50.04 | 66.35 |

| aspire-sentence-embedder | 30.58 | 53.86 | 11.64 | 36.50 | 50.36 | 66.63 |

The released model sees similar performance across datasets to the per-domain cosentbert models used in our paper (and reported above).

Runs of allenai aspire-sentence-embedder on huggingface.co

140
Total runs
0
24-hour runs
2
3-day runs
-8
7-day runs
-10
30-day runs

More Information About aspire-sentence-embedder huggingface.co Model

aspire-sentence-embedder huggingface.co

aspire-sentence-embedder huggingface.co is an AI model on huggingface.co that provides aspire-sentence-embedder's model effect (), which can be used instantly with this allenai aspire-sentence-embedder model. huggingface.co supports a free trial of the aspire-sentence-embedder model, and also provides paid use of the aspire-sentence-embedder. Support call aspire-sentence-embedder model through api, including Node.js, Python, http.

aspire-sentence-embedder huggingface.co Url

https://huggingface.co/allenai/aspire-sentence-embedder

allenai aspire-sentence-embedder online free

aspire-sentence-embedder huggingface.co is an online trial and call api platform, which integrates aspire-sentence-embedder's modeling effects, including api services, and provides a free online trial of aspire-sentence-embedder, you can try aspire-sentence-embedder online for free by clicking the link below.

allenai aspire-sentence-embedder online free url in huggingface.co:

https://huggingface.co/allenai/aspire-sentence-embedder

aspire-sentence-embedder install

aspire-sentence-embedder is an open source model from GitHub that offers a free installation service, and any user can find aspire-sentence-embedder on GitHub to install. At the same time, huggingface.co provides the effect of aspire-sentence-embedder install, users can directly use aspire-sentence-embedder installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

aspire-sentence-embedder install url in huggingface.co:

https://huggingface.co/allenai/aspire-sentence-embedder

Url of aspire-sentence-embedder

aspire-sentence-embedder huggingface.co Url

Provider of aspire-sentence-embedder huggingface.co

allenai
ORGANIZATIONS

Other API from allenai

huggingface.co

Total runs: 91.7K
Run Growth: 78.6K
Growth Rate: 85.70%
Updated: 2023年10月18日
huggingface.co

Total runs: 91.3K
Run Growth: 63.3K
Growth Rate: 69.57%
Updated: 2025年1月6日
huggingface.co

Total runs: 77.3K
Run Growth: -517.1K
Growth Rate: -669.13%
Updated: 2024年10月10日
huggingface.co

Total runs: 63.3K
Run Growth: 51.7K
Growth Rate: 81.63%
Updated: 2024年10月10日
huggingface.co

Total runs: 61.6K
Run Growth: -50.5K
Growth Rate: -81.96%
Updated: 2024年12月4日
huggingface.co

Total runs: 23.0K
Run Growth: 7.7K
Growth Rate: 33.79%
Updated: 2024年8月14日
huggingface.co

Total runs: 8.5K
Run Growth: 3.3K
Growth Rate: 36.78%
Updated: 2024年7月16日
huggingface.co

Total runs: 6.4K
Run Growth: 3.3K
Growth Rate: 51.01%
Updated: 2024年10月10日
huggingface.co

Total runs: 6.1K
Run Growth: -21.5K
Growth Rate: -354.06%
Updated: 2024年7月3日
huggingface.co

Total runs: 5.1K
Run Growth: -17.0K
Growth Rate: -321.48%
Updated: 2024年7月16日
huggingface.co

Total runs: 3.7K
Run Growth: 1.7K
Growth Rate: 44.61%
Updated: 2024年5月14日
huggingface.co

Total runs: 2.5K
Run Growth: -163
Growth Rate: -6.49%
Updated: 2024年12月4日
huggingface.co

Total runs: 1.7K
Run Growth: -110
Growth Rate: -6.43%
Updated: 2024年7月16日
huggingface.co

Total runs: 895
Run Growth: 878
Growth Rate: 98.10%
Updated: 2023年1月24日
huggingface.co

Total runs: 502
Run Growth: -100
Growth Rate: -21.23%
Updated: 2023年1月24日
huggingface.co

Total runs: 486
Run Growth: 256
Growth Rate: 52.67%
Updated: 2024年2月12日
huggingface.co

Total runs: 374
Run Growth: 354
Growth Rate: 94.65%
Updated: 2024年6月13日
huggingface.co

Total runs: 313
Run Growth: -437
Growth Rate: -139.62%
Updated: 2024年4月30日
huggingface.co

Total runs: 297
Run Growth: 159
Growth Rate: 53.54%
Updated: 2024年4月19日