ProtST-esm1b huggingface.co api & mila-intel ProtST-esm1b github AI Model

Introduction of ProtST-esm1b

Model Details of ProtST-esm1b

Abstract

Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property information with different granularities and, at the same time, preserve the PLM’s original representation power. On downstream tasks, ProtST enables both supervised learning and zeroshot prediction. We verify the superiority of ProtST-induced PLMs over previous ones on diverse representation learning benchmarks. Under the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein classification, and ProtST also enables functional protein retrieval from a large-scale database without any function annotation. Source code and model weights are available at https://github.com/DeepGraphLearning/ProtST .

Performance

Zero-shot ProtST-ESM-1b outperforms few-shot classifiers.

Example

The following script shows how to run ProtST with Gaudi on zero-shot classification task.

import logging
import functools
from tqdm import tqdm
import torch
from datasets import load_dataset
from transformers import AutoModel, AutoTokenizer, AutoConfig
+ import habana_frameworks.torch

logger = logging.getLogger(__name__)


def tokenize_protein(example, protein_tokenizer=None, padding=None):
    protein_seqs = example["prot_seq"]

-   protein_inputs = protein_tokenizer(protein_seqs, padding=padding, add_special_tokens=True)
+   protein_inputs = protein_tokenizer(protein_seqs, padding="max_length", truncation=True, add_special_tokens=True, max_length=1024)
    example["protein_input_ids"] = protein_inputs.input_ids
    example["protein_attention_mask"] = protein_inputs.attention_mask
    
    return example


def label_embedding(labels, text_tokenizer, text_model, device):
    # embed label descriptions
    label_feature = []
    with torch.inference_mode():
        for label in labels:
            label_input_ids = text_tokenizer.encode(label, max_length=128,
-                                                   truncation=True, add_special_tokens=False)
+                                                   truncation=True, add_special_tokens=False, padding="max_length")
            label_input_ids = [text_tokenizer.cls_token_id] + label_input_ids
            label_input_ids = torch.tensor(label_input_ids, dtype=torch.long, device=device).unsqueeze(0)
            attention_mask = label_input_ids != text_tokenizer.pad_token_id
            attention_mask = attention_mask.to(device)

            text_outputs = text_model(label_input_ids, attention_mask=attention_mask)

-           label_feature.append(text_outputs["text_feature"])
+           label_feature.append(text_outputs["text_feature"].clone())
    label_feature = torch.cat(label_feature, dim=0)
    label_feature = label_feature / label_feature.norm(dim=-1, keepdim=True)

    return label_feature

def zero_shot_eval(logger, device,                                          
                   test_dataset, target_field, protein_model, logit_scale, label_feature):

    # get prediction and target
    test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False)
    preds, targets = [], []
    with torch.inference_mode():
        for data in tqdm(test_dataloader):
            target = data[target_field]
            targets.append(target)

            protein_input_ids = torch.tensor(data["protein_input_ids"], dtype=torch.long, device=device).unsqueeze(0)
            attention_mask = torch.tensor(data["protein_attention_mask"], dtype=torch.long, device=device).unsqueeze(0)
            
            protein_outputs = protein_model(protein_input_ids, attention_mask=attention_mask)

            protein_feature = protein_outputs["protein_feature"]
            protein_feature = protein_feature / protein_feature.norm(dim=-1, keepdim=True)
            pred = logit_scale * protein_feature @ label_feature.t()
            preds.append(pred)

    preds = torch.cat(preds, dim=0)
    targets = torch.tensor(targets, dtype=torch.long, device=device)
    accuracy = (preds.argmax(dim=-1) == targets).float().mean().item()
    logger.warning("Zero-shot accuracy: %.6f" % accuracy)


if __name__ == "__main__":
    # get datasets
    raw_datasets = load_dataset("mila-intel/ProtST-SubcellularLocalization", cache_dir="~/.cache/huggingface/datasets", split='test') # cache_dir defaults to "~/.cache/huggingface/datasets"

-   device = torch.device("cpu")
+   device = torch.device("hpu")
    
    protst_model = AutoModel.from_pretrained("mila-intel/ProtST-esm1b", trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)
    protein_model = protst_model.protein_model
    text_model = protst_model.text_model
    logit_scale = protst_model.logit_scale
+   from habana_frameworks.torch.hpu import wrap_in_hpu_graph
+   protein_model = wrap_in_hpu_graph(protein_model)
+   text_model = wrap_in_hpu_graph(text_model)
    logit_scale.requires_grad = False
    logit_scale = logit_scale.to(device)
    logit_scale = logit_scale.exp()

    protein_tokenizer = AutoTokenizer.from_pretrained("facebook/esm1b_t33_650M_UR50S")
    text_tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract")

    func_tokenize_protein = functools.partial(tokenize_protein, protein_tokenizer=protein_tokenizer, padding=False)
    test_dataset = raw_datasets.map(
            func_tokenize_protein, batched=False,
            remove_columns=["prot_seq"],
            desc="Running tokenize_proteins on dataset",
        )

    labels = load_dataset("mila-intel/subloc_template", cache_dir="~/.cache/huggingface/datasets")["train"]["name"]

    text_tokenizer.encode(labels[0], max_length=128, truncation=True, add_special_tokens=False)
    label_feature = label_embedding(labels, text_tokenizer, text_model, device)
    zero_shot_eval(logger, device, test_dataset, "localization",
                   protein_model, logit_scale, label_feature)

Run ProtST on CPU with optimum-intel optimization.

...
    protst_model = AutoModel.from_pretrained("mila-intel/ProtST-esm1b", trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)
    protein_model = protst_model.protein_model
+   import intel_extension_for_pytorch as ipex
+   from optimum.intel.generation.modeling import jit_trace
+   protein_model = ipex.optimize(protein_model, dtype=torch.bfloat16, inplace=True)
+   protein_model = jit_trace(protein_model, "sequence-classification")
...

Runs of mila-intel ProtST-esm1b on huggingface.co

243

Total runs

24-hour runs

3-day runs

195

7-day runs

142

30-day runs

More Information About ProtST-esm1b huggingface.co Model

ProtST-esm1b huggingface.co

ProtST-esm1b huggingface.co is an AI model on huggingface.co that provides ProtST-esm1b's model effect (), which can be used instantly with this mila-intel ProtST-esm1b model. huggingface.co supports a free trial of the ProtST-esm1b model, and also provides paid use of the ProtST-esm1b. Support call ProtST-esm1b model through api, including Node.js, Python, http.

ProtST-esm1b huggingface.co Url

https://huggingface.co/mila-intel/ProtST-esm1b

mila-intel ProtST-esm1b online free

ProtST-esm1b huggingface.co is an online trial and call api platform, which integrates ProtST-esm1b's modeling effects, including api services, and provides a free online trial of ProtST-esm1b, you can try ProtST-esm1b online for free by clicking the link below.

mila-intel ProtST-esm1b online free url in huggingface.co:

https://huggingface.co/mila-intel/ProtST-esm1b

ProtST-esm1b install

ProtST-esm1b is an open source model from GitHub that offers a free installation service, and any user can find ProtST-esm1b on GitHub to install. At the same time, huggingface.co provides the effect of ProtST-esm1b install, users can directly use ProtST-esm1b installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

ProtST-esm1b install url in huggingface.co:

https://huggingface.co/mila-intel/ProtST-esm1b

huggingface.co

mila-intel/protst-esm1b-for-sequential-classification

Total runs: 345

Run Growth: 83

Growth Rate: 24.63%

Updated: Abril 12 2024

mila-intel / ProtST-esm1b

Introduction of ProtST-esm1b

Model Details of ProtST-esm1b

Abstract

Performance

Example

Runs of mila-intel ProtST-esm1b on huggingface.co

More Information About ProtST-esm1b huggingface.co Model

ProtST-esm1b huggingface.co

ProtST-esm1b huggingface.co Url

mila-intel ProtST-esm1b online free

mila-intel ProtST-esm1b online free url in huggingface.co:

ProtST-esm1b install

ProtST-esm1b install url in huggingface.co:

Url of ProtST-esm1b

ProtST-esm1b huggingface.co Url

Provider of ProtST-esm1b huggingface.co

Other API from mila-intel