This model is built iteratively starting from an off-the-shelf
SigLIP
model.
We finetuned it to create
BiSigLIP
and fed the patch-embeddings output by SigLIP to an LLM,
PaliGemma-3B
to create
BiPali
.
One benefit of inputting image patch embeddings through a language model is that they are natively mapped to a latent space similar to textual input (query).
This enables leveraging the
ColBERT
strategy to compute interactions between text tokens and image patches, which enables a step-change improvement in performance compared to BiPali.
Model Training
Dataset
Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%).
Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both
ViDoRe
and in the train set to prevent evaluation contamination.
A validation set is created with 2% of the samples to tune hyperparameters.
Note: Multilingual data is present in the pretraining corpus of the language model (Gemma-2B) and potentially occurs during PaliGemma-3B's multimodal training.
Parameters
All models are trained for 1 epoch on the train set. Unless specified otherwise, we train models in
bfloat16
format, use low-rank adapters (
LoRA
)
with
alpha=32
and
r=32
on the transformer layers from the language model,
as well as the final randomly initialized projection layer, and use a
paged_adamw_8bit
optimizer.
We train on an 8 GPU setup with data parallelism, a learning rate of 5e-5 with linear decay with 2.5% warmup steps, and a batch size of 32.
Usage
import torch
import typer
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import AutoProcessor
from PIL import Image
from colpali_engine.models.paligemma_colbert_architecture import ColPali
from colpali_engine.trainer.retrieval_evaluator import CustomEvaluator
from colpali_engine.utils.colpali_processing_utils import process_images, process_queries
from colpali_engine.utils.image_from_page_utils import load_from_dataset
defmain() -> None:
"""Example script to run inference with ColPali"""# Load model
model_name = "vidore/colpali"
model = ColPali.from_pretrained("google/paligemma-3b-mix-448", torch_dtype=torch.bfloat16, device_map="cuda").eval()
model.load_adapter(model_name)
processor = AutoProcessor.from_pretrained(model_name)
# select images -> load_from_pdf(<pdf_path>), load_from_image_urls(["<url_1>"]), load_from_dataset(<path>)
images = load_from_dataset("vidore/docvqa_test_subsampled")
queries = ["From which university does James V. Fiorca come ?", "Who is the japanese prime minister?"]
# run inference - docs
dataloader = DataLoader(
images,
batch_size=4,
shuffle=False,
collate_fn=lambda x: process_images(processor, x),
)
ds = []
for batch_doc in tqdm(dataloader):
with torch.no_grad():
batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}
embeddings_doc = model(**batch_doc)
ds.extend(list(torch.unbind(embeddings_doc.to("cpu"))))
# run inference - queries
dataloader = DataLoader(
queries,
batch_size=4,
shuffle=False,
collate_fn=lambda x: process_queries(processor, x, Image.new("RGB", (448, 448), (255, 255, 255))),
)
qs = []
for batch_query in dataloader:
with torch.no_grad():
batch_query = {k: v.to(model.device) for k, v in batch_query.items()}
embeddings_query = model(**batch_query)
qs.extend(list(torch.unbind(embeddings_query.to("cpu"))))
# run evaluation
retriever_evaluator = CustomEvaluator(is_multi_vector=True)
scores = retriever_evaluator.evaluate(qs, ds)
print(scores.argmax(axis=1))
if __name__ == "__main__":
typer.run(main)
Limitations
Focus
: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages.
Support
: The model relies on multi-vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi-vector support.
License
ColPali's vision language backbone model (PaliGemma) is under
gemma
license as specified in its
model card
. The adapters attached to the model are under MIT license.
If you use any datasets or models from this organization in your research, please cite the original dataset as follows:
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
Runs of vidore colpali on huggingface.co
26.5K
Total runs
0
24-hour runs
-558
3-day runs
-1.3K
7-day runs
-14.1K
30-day runs
More Information About colpali huggingface.co Model
colpali huggingface.co is an AI model on huggingface.co that provides colpali's model effect (), which can be used instantly with this vidore colpali model. huggingface.co supports a free trial of the colpali model, and also provides paid use of the colpali. Support call colpali model through api, including Node.js, Python, http.
colpali huggingface.co is an online trial and call api platform, which integrates colpali's modeling effects, including api services, and provides a free online trial of colpali, you can try colpali online for free by clicking the link below.
colpali is an open source model from GitHub that offers a free installation service, and any user can find colpali on GitHub to install. At the same time, huggingface.co provides the effect of colpali install, users can directly use colpali installed effect in huggingface.co for debugging and trial. It also supports api for free installation.