nvidia / Llama-3.1-70B-Instruct-FP8

huggingface.co
Total runs: 816
24-hour runs: 18
7-day runs: -93
30-day runs: 509
Model's Last Updated: January 10 2025

Introduction of Llama-3.1-70B-Instruct-FP8

Model Details of Llama-3.1-70B-Instruct-FP8

Model Overview

Description:

The NVIDIA Llama 3.1 70B Instruct FP8 model is the quantized version of the Meta's Llama 3.1 70B Instruct model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check here . The NVIDIA Llama 3.1 70B Instruct FP8 model is quantized with TensorRT Model Optimizer .

This model is ready for commercial/non-commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA (Meta-Llama-3.1-70B-Instruct) Model Card .

License/Terms of Use:

nvidia-open-model-license

Model Architecture:

Architecture Type: Transformers
Network Architecture: Llama3.1

Input:

Input Type(s): Text
Input Format(s): String
Input Parameters: Sequences
Other Properties Related to Input: Context length up to 128K

Output:

Output Type(s): Text
Output Format: String
Output Parameters: Sequences
Other Properties Related to Output: N/A

Software Integration:

Supported Runtime Engine(s):

  • Tensor(RT)-LLM
  • vLLM

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Blackwell
  • NVIDIA Hopper
  • NVIDIA Lovelace

Preferred Operating System(s):

  • Linux
Model Version(s):

The model is quantized with nvidia-modelopt v0.15.1

Datasets:
Inference:

Engine: Tensor(RT)-LLM or vLLM
Test Hardware: H100

Post Training Quantization

This model was obtained by quantizing the weights and activations of Meta-Llama-3.1-70B-Instruct to FP8 data type, ready for inference with TensorRT-LLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. On H100, we achieved 1.5x speedup.

Usage
Deploy with TensorRT-LLM

To deploy the quantized checkpoint with TensorRT-LLM , follow the sample commands below with the TensorRT-LLM GitHub repo:

  • Checkpoint convertion:
python examples/llama/convert_checkpoint.py --model_dir Llama-3.1-70B-Instruct-FP8 --output_dir /ckpt --use_fp8
  • Build engines:
trtllm-build --checkpoint_dir /ckpt --output_dir /engine
  • Accuracy evaluation:
  1. Prepare the MMLU dataset:
mkdir data; wget https://people.eecs.berkeley.edu/~hendrycks/data.tar -O data/mmlu.tar
tar -xf data/mmlu.tar -C data && mv data/data data/mmlu
  1. Measure MMLU:
python examples/mmlu.py --engine_dir ./engine --tokenizer_dir Llama-3.1-70B-Instruct-FP8/ --test_trt_llm --data_dir data/mmlu
  • Throughputs evaluation:

Please refer to the TensorRT-LLM benchmarking documentation for details.

Evaluation

The accuracy (MMLU, 5-shot) and throughputs (tokens per second, TPS) benchmark results are presented in the table below:

Precision MMLU TPS
FP16 82.5 1356.92
FP8 82.3 2040.30
We benchmarked with tensorrt-llm v0.13 on 8 H100 GPUs, using batch size 1024 for the throughputs with in-flight batching enabled. We achieved **~1.5x** speedup with FP8.
Deploy with vLLM

To deploy the quantized checkpoint with vLLM , follow the instructions below:

  1. Install vLLM from directions here .
  2. To use a Model Optimizer PTQ checkpoint with vLLM, quantization=modelopt flag must be passed into the config while initializing the LLM Engine.

Example deployment on an H100:

from vllm import LLM, SamplingParams

model_id = "nvidia/Llama-3.1-70B-Instruct-FP8"
sampling_params = SamplingParams(temperature=0.8, top_p=0.9)

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

llm = LLM(model=model_id, quantization="modelopt")
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

This model can be deployed with an OpenAI Compatible Server via the vLLM backend. Instructions here .

Runs of nvidia Llama-3.1-70B-Instruct-FP8 on huggingface.co

816
Total runs
18
24-hour runs
17
3-day runs
-93
7-day runs
509
30-day runs

More Information About Llama-3.1-70B-Instruct-FP8 huggingface.co Model

Llama-3.1-70B-Instruct-FP8 huggingface.co

Llama-3.1-70B-Instruct-FP8 huggingface.co is an AI model on huggingface.co that provides Llama-3.1-70B-Instruct-FP8's model effect (), which can be used instantly with this nvidia Llama-3.1-70B-Instruct-FP8 model. huggingface.co supports a free trial of the Llama-3.1-70B-Instruct-FP8 model, and also provides paid use of the Llama-3.1-70B-Instruct-FP8. Support call Llama-3.1-70B-Instruct-FP8 model through api, including Node.js, Python, http.

Llama-3.1-70B-Instruct-FP8 huggingface.co Url

https://huggingface.co/nvidia/Llama-3.1-70B-Instruct-FP8

nvidia Llama-3.1-70B-Instruct-FP8 online free

Llama-3.1-70B-Instruct-FP8 huggingface.co is an online trial and call api platform, which integrates Llama-3.1-70B-Instruct-FP8's modeling effects, including api services, and provides a free online trial of Llama-3.1-70B-Instruct-FP8, you can try Llama-3.1-70B-Instruct-FP8 online for free by clicking the link below.

nvidia Llama-3.1-70B-Instruct-FP8 online free url in huggingface.co:

https://huggingface.co/nvidia/Llama-3.1-70B-Instruct-FP8

Llama-3.1-70B-Instruct-FP8 install

Llama-3.1-70B-Instruct-FP8 is an open source model from GitHub that offers a free installation service, and any user can find Llama-3.1-70B-Instruct-FP8 on GitHub to install. At the same time, huggingface.co provides the effect of Llama-3.1-70B-Instruct-FP8 install, users can directly use Llama-3.1-70B-Instruct-FP8 installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

Llama-3.1-70B-Instruct-FP8 install url in huggingface.co:

https://huggingface.co/nvidia/Llama-3.1-70B-Instruct-FP8

Url of Llama-3.1-70B-Instruct-FP8

Llama-3.1-70B-Instruct-FP8 huggingface.co Url

Provider of Llama-3.1-70B-Instruct-FP8 huggingface.co

nvidia
ORGANIZATIONS

Other API from nvidia

huggingface.co

Total runs: 273.2K
Run Growth: 58.7K
Growth Rate: 23.39%
Updated: November 30 2024
huggingface.co

Total runs: 49.4K
Run Growth: 41.0K
Growth Rate: 85.79%
Updated: January 14 2025
huggingface.co

Total runs: 43.9K
Run Growth: -119.7K
Growth Rate: -284.27%
Updated: November 15 2023
huggingface.co

Total runs: 37.2K
Run Growth: 3.6K
Growth Rate: 9.10%
Updated: August 06 2022
huggingface.co

Total runs: 20.6K
Run Growth: 1.3K
Growth Rate: 5.90%
Updated: August 06 2022
huggingface.co

Total runs: 20.3K
Run Growth: 1.6K
Growth Rate: 7.71%
Updated: August 06 2022
huggingface.co

Total runs: 17.3K
Run Growth: 7.5K
Growth Rate: 44.69%
Updated: May 08 2024
huggingface.co

Total runs: 9.2K
Run Growth: 6.3K
Growth Rate: 69.92%
Updated: August 06 2022
huggingface.co

Total runs: 5.2K
Run Growth: -1.5K
Growth Rate: -26.67%
Updated: November 30 2024
huggingface.co

Total runs: 3.4K
Run Growth: 3.2K
Growth Rate: 91.94%
Updated: January 28 2025
huggingface.co

Total runs: 3.2K
Run Growth: 3.0K
Growth Rate: 95.56%
Updated: January 28 2025
huggingface.co

Total runs: 2.5K
Run Growth: 844
Growth Rate: 35.96%
Updated: August 06 2022
huggingface.co

Total runs: 1.9K
Run Growth: 520
Growth Rate: 27.72%
Updated: December 18 2024
huggingface.co

Total runs: 1.2K
Run Growth: 57
Growth Rate: 5.14%
Updated: November 06 2024
huggingface.co

Total runs: 981
Run Growth: -979
Growth Rate: -85.80%
Updated: December 02 2024
huggingface.co

Total runs: 846
Run Growth: 99
Growth Rate: 11.99%
Updated: December 10 2024
huggingface.co

Total runs: 807
Run Growth: 776
Growth Rate: 96.64%
Updated: January 28 2025