An official quantization of
meta-llama/Meta-Llama-3-8B
using
PV-Tuning
on top of
AQLM
.
For this quantization, we used 1 codebook of 16 bits for groups of 16 weights.
The 1x16g16 models require aqlm inference library v1.1.6 or newer:
pip install aqlm[gpu,cpu]>=1.1.6
Note that a large portion of this model are the 16-bit embeddings/logits matrices. You can significantly reduce the model footprint by quantizing these matrices, e.g. using
bitsandbytes
LLM.int8 or NF4 formats. This does not require additional training.
Model
|
AQLM scheme
|
WikiText 2 PPL
|
Model size, Gb
|
Hub link
|
meta-llama/Meta-Llama-3-8B
|
1x16g8
|
6.99
|
4.1
|
Link
|
meta-llama/Meta-Llama-3-8B (this)
|
1x16g16
|
9.43
|
3.9
|
Link
|
meta-llama/Meta-Llama-3-70B
|
1x16g8
|
4.57
|
21.9
|
Link
|
To learn more about the inference, as well as the information on how to quantize models yourself, please refer to the
official GitHub repo
.
The original code for PV-Tuning can be found in the
AQLM@pv-tuning
branch.