The NVIDIA Llama 3.1 70B Instruct FP8 model is the quantized version of the Meta's Llama 3.1 70B Instruct model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check
here
. The NVIDIA Llama 3.1 70B Instruct FP8 model is quantized with
TensorRT Model Optimizer
.
This model is ready for commercial/non-commercial use.
Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA
(Meta-Llama-3.1-70B-Instruct) Model Card
.
Engine:
Tensor(RT)-LLM or vLLM
Test Hardware:
H100
Post Training Quantization
This model was obtained by quantizing the weights and activations of Meta-Llama-3.1-70B-Instruct to FP8 data type, ready for inference with TensorRT-LLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. On H100, we achieved 1.5x speedup.
Usage
Deploy with TensorRT-LLM
To deploy the quantized checkpoint with
TensorRT-LLM
, follow the sample commands below with the TensorRT-LLM GitHub repo:
The accuracy (MMLU, 5-shot) and throughputs (tokens per second, TPS) benchmark results are presented in the table below:
Precision
MMLU
TPS
FP16
82.5
1356.92
FP8
82.3
2040.30
We benchmarked with tensorrt-llm v0.13 on 8 H100 GPUs, using batch size 1024 for the throughputs with in-flight batching enabled. We achieved **~1.5x** speedup with FP8.
Deploy with vLLM
To deploy the quantized checkpoint with
vLLM
, follow the instructions below:
To use a Model Optimizer PTQ checkpoint with vLLM,
quantization=modelopt
flag must be passed into the config while initializing the
LLM
Engine.
Example deployment on an H100:
from vllm import LLM, SamplingParams
model_id = "nvidia/Llama-3.1-70B-Instruct-FP8"
sampling_params = SamplingParams(temperature=0.8, top_p=0.9)
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
llm = LLM(model=model_id, quantization="modelopt")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
This model can be deployed with an OpenAI Compatible Server via the vLLM backend. Instructions
here
.
Runs of nvidia Llama-3.1-70B-Instruct-FP8 on huggingface.co
816
Total runs
18
24-hour runs
17
3-day runs
-93
7-day runs
509
30-day runs
More Information About Llama-3.1-70B-Instruct-FP8 huggingface.co Model
Llama-3.1-70B-Instruct-FP8 huggingface.co
Llama-3.1-70B-Instruct-FP8 huggingface.co is an AI model on huggingface.co that provides Llama-3.1-70B-Instruct-FP8's model effect (), which can be used instantly with this nvidia Llama-3.1-70B-Instruct-FP8 model. huggingface.co supports a free trial of the Llama-3.1-70B-Instruct-FP8 model, and also provides paid use of the Llama-3.1-70B-Instruct-FP8. Support call Llama-3.1-70B-Instruct-FP8 model through api, including Node.js, Python, http.
Llama-3.1-70B-Instruct-FP8 huggingface.co is an online trial and call api platform, which integrates Llama-3.1-70B-Instruct-FP8's modeling effects, including api services, and provides a free online trial of Llama-3.1-70B-Instruct-FP8, you can try Llama-3.1-70B-Instruct-FP8 online for free by clicking the link below.
nvidia Llama-3.1-70B-Instruct-FP8 online free url in huggingface.co:
Llama-3.1-70B-Instruct-FP8 is an open source model from GitHub that offers a free installation service, and any user can find Llama-3.1-70B-Instruct-FP8 on GitHub to install. At the same time, huggingface.co provides the effect of Llama-3.1-70B-Instruct-FP8 install, users can directly use Llama-3.1-70B-Instruct-FP8 installed effect in huggingface.co for debugging and trial. It also supports api for free installation.
Llama-3.1-70B-Instruct-FP8 install url in huggingface.co: