CodeT5+
is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e.
encoder-only
,
decoder-only
, and
encoder-decoder
) to support a wide range of code understanding and generation tasks.
It is introduced in the paper:
Compared to the original CodeT5 family (base:
220M
, large:
770M
), CodeT5+ is pretrained with a diverse set of pretraining tasks including
span denoising
,
causal language modeling
,
contrastive learning
, and
text-code matching
to learn rich representations from both unimodal code data and bimodal code-text data.
Additionally, it employs a simple yet effective
compute-efficient pretraining
method to initialize the model components with frozen off-the-shelf LLMs such as
CodeGen
to efficiently scale up the model (i.e.
2B
,
6B
,
16B
), and adopts a "shallow encoder and deep decoder" architecture.
Furthermore, it is instruction-tuned to align with natural language instructions (see our InstructCodeT5+ 16B) following
Code Alpaca
.
How to use
This model can be easily loaded using the
T5ForConditionalGeneration
functionality and employs the same tokenizer as original
CodeT5
.
from transformers import T5ForConditionalGeneration, AutoTokenizer
checkpoint = "Salesforce/codet5p-770m-py"
device = "cuda"# for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# ==> print('Hello World!')
Pretraining data
This checkpoint is trained on the stricter permissive subset of the deduplicated version of the
github-code dataset
.
The data is preprocessed by reserving only permissively licensed code ("mit" “apache-2”, “bsd-3-clause”, “bsd-2-clause”, “cc0-1.0”, “unlicense”, “isc”).
Supported languages (9 in total) are as follows:
c
,
c++
,
c-sharp
,
go
,
java
,
javascript
,
php
,
python
,
ruby.
Training procedure
This checkpoint is first trained on the multilingual unimodal code data at the first-stage pretraining, which includes a diverse set of pretraining tasks including
span denoising
and two variants of
causal language modeling
.
After that, it is further trained on the Python subset with the causal language modeling objective for another epoch to better adapt for Python code generation. Please refer to the paper for more details.
Evaluation results
CodeT5+ models have been comprehensively evaluated on a wide range of code understanding and generation tasks in various settings:
zero-shot
,
finetuning
, and
instruction-tuning
.
Specifically, CodeT5+ yields substantial performance gains on many downstream tasks compared to their SoTA baselines, e.g.,
8 text-to-code retrieval tasks (+3.2 avg. MRR), 2 line-level code completion tasks (+2.1 avg. Exact Match), and 2 retrieval-augmented code generation tasks (+5.8 avg. BLEU-4).
In 2 math programming tasks on MathQA-Python and GSM8K-Python, CodeT5+ models of below billion-parameter sizes significantly outperform many LLMs of up to 137B parameters.
Particularly, in the zero-shot text-to-code generation task on HumanEval benchmark, InstructCodeT5+ 16B sets new SoTA results of 35.0% pass@1 and 54.5% pass@10 against other open code LLMs, even surpassing the closed-source OpenAI code-cushman-001 mode
Please refer to the
paper
for more details.
Specifically for this checkpoint, it achieves 15.5% pass@1 on HumanEval in the zero-shot setting, which is comparable to much larger LLMs such as Incoder 6B’s 15.2%, GPT-NeoX 20B’s 15.4%, and PaLM 62B’s 15.9%.
BibTeX entry and citation info
@article{wang2023codet5plus,
title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
journal={arXiv preprint},
year={2023}
}
Runs of Salesforce codet5p-770m-py on huggingface.co
5.5K
Total runs
-2
24-hour runs
61
3-day runs
828
7-day runs
5.1K
30-day runs
More Information About codet5p-770m-py huggingface.co Model
codet5p-770m-py huggingface.co is an AI model on huggingface.co that provides codet5p-770m-py's model effect (), which can be used instantly with this Salesforce codet5p-770m-py model. huggingface.co supports a free trial of the codet5p-770m-py model, and also provides paid use of the codet5p-770m-py. Support call codet5p-770m-py model through api, including Node.js, Python, http.
codet5p-770m-py huggingface.co is an online trial and call api platform, which integrates codet5p-770m-py's modeling effects, including api services, and provides a free online trial of codet5p-770m-py, you can try codet5p-770m-py online for free by clicking the link below.
Salesforce codet5p-770m-py online free url in huggingface.co:
codet5p-770m-py is an open source model from GitHub that offers a free installation service, and any user can find codet5p-770m-py on GitHub to install. At the same time, huggingface.co provides the effect of codet5p-770m-py install, users can directly use codet5p-770m-py installed effect in huggingface.co for debugging and trial. It also supports api for free installation.