Introduction of Meta-Llama-3.1-8B-Instruct-FP8-dynamic
Model Details of Meta-Llama-3.1-8B-Instruct-FP8-dynamic
Meta-Llama-3.1-8B-Instruct-FP8-dynamic
Model Overview
Model Architecture:
Meta-Llama-3.1
Input:
Text
Output:
Text
Model Optimizations:
Weight quantization:
FP8
Activation quantization:
FP8
Intended Use Cases:
Intended for commercial and research use in multiple languages. Similarly to
Meta-Llama-3.1-8B-Instruct
, this models is intended for assistant-like chat.
Out-of-scope:
Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
Quantized version of
Meta-Llama-3.1-8B-Instruct
.
It achieves an average score of 73.81 on the
OpenLLM
benchmark (version 1), whereas the unquantized model achieves 74.17.
Model Optimizations
This model was obtained by quantizing the weights and activations of
Meta-Llama-3.1-8B-Instruct
to FP8 data type, ready for inference with vLLM built from source.
This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
LLM Compressor
is used for quantization with 512 sequences of UltraChat.
Deployment
Use with vLLM
This model can be deployed efficiently using the
vLLM
backend, as shown in the example below.
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic"
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
prompts = tokenizer.apply_chat_template(messages, tokenize=False)
llm = LLM(model=model_id)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
vLLM aslo supports OpenAI-compatible serving. See the
documentation
for more details.
The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
Evaluation was conducted using the Neural Magic fork of
lm-evaluation-harness
(branch llama_3.1_instruct) and the
vLLM
engine.
This version of the lm-evaluation-harness includes versions of ARC-Challenge and GSM-8K that match the prompting style of
Meta-Llama-3.1-Instruct-evals
.
Meta-Llama-3.1-8B-Instruct-FP8-dynamic huggingface.co is an AI model on huggingface.co that provides Meta-Llama-3.1-8B-Instruct-FP8-dynamic's model effect (), which can be used instantly with this neuralmagic Meta-Llama-3.1-8B-Instruct-FP8-dynamic model. huggingface.co supports a free trial of the Meta-Llama-3.1-8B-Instruct-FP8-dynamic model, and also provides paid use of the Meta-Llama-3.1-8B-Instruct-FP8-dynamic. Support call Meta-Llama-3.1-8B-Instruct-FP8-dynamic model through api, including Node.js, Python, http.
Meta-Llama-3.1-8B-Instruct-FP8-dynamic huggingface.co is an online trial and call api platform, which integrates Meta-Llama-3.1-8B-Instruct-FP8-dynamic's modeling effects, including api services, and provides a free online trial of Meta-Llama-3.1-8B-Instruct-FP8-dynamic, you can try Meta-Llama-3.1-8B-Instruct-FP8-dynamic online for free by clicking the link below.
neuralmagic Meta-Llama-3.1-8B-Instruct-FP8-dynamic online free url in huggingface.co:
Meta-Llama-3.1-8B-Instruct-FP8-dynamic is an open source model from GitHub that offers a free installation service, and any user can find Meta-Llama-3.1-8B-Instruct-FP8-dynamic on GitHub to install. At the same time, huggingface.co provides the effect of Meta-Llama-3.1-8B-Instruct-FP8-dynamic install, users can directly use Meta-Llama-3.1-8B-Instruct-FP8-dynamic installed effect in huggingface.co for debugging and trial. It also supports api for free installation.
Meta-Llama-3.1-8B-Instruct-FP8-dynamic install url in huggingface.co: