We are excited to announce the continuation and rebranding of our
BLIP series
into
XGen-MM
, to be better aligned with Salesforce's unified XGen initiative for large foundation models! This rebranding marks a significant step in our ongoing development of cutting-edge multimodal technologies.
XGen-MM
is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the
BLIP
series, incorporating fundamental enhancements that ensure a more robust and superior foundation.
These models have been trained at scale on high-quality image caption datasets and interleaved image-text data. XGen-MM highlights a few features below,
The
pretrained
foundation model,
xgen-mm-phi3-mini-base-r-v1
, achieves state-of-the-art performance under 5b parameters and demonstrates strong in-context learning capabilities.
The
instruct
fine-tuned model,
xgen-mm-phi3-mini-instruct-r-v1
, achieves state-of-the-art performance among open-source and closed-source VLMs under 5b parameters.
LLaVA-Instruct-150K, ShareGPT4V captions, a mixture of academic VQA data including OCR/Document/Chart-focused tasks, publicly available text-only instruction data
Results
Pretrain (base model without instruction tuning)
Model
Shot
COCO (val)
NoCaps (val)
TextCaps (val)
OKVQA (val)
TextVQA (val)
VizWiz (testdev)
VQAv2 (testdev)
Flamingo-3B
4
85.0
-
-
43.3
32.7
34
53.2
8
90.6
-
-
44.6
32.4
38.4
55.4
MM1-3B
0
73.5
55.6
63.3
26.1
29.4
15.6
46.2
4
112.3
99.7
84.1
48.6
45.3
38.0
57.9
8
114.6
104.7
88.8
48.4
44.6
46.4
63.6
xgen-mm-phi3-mini-base-r-v1 (Ours)
0
81.7
80.2
60.7
26.5
36.0
21.2
48.1
4
110.5
101.7
84.6
49.2
46.1
38.4
63.9
8
112.1
104.4
87.7
49.1
46.4
44.3
63.8
Instruct (after instruction tuning)
Model
SEED-IMG
MMBench(dev)
MME-total
MME-P
MME-C
MMStar
MMMU (val)
MMVet
MathVista (mini)
ScienceQA (test)
POPE
AI2D
MM1-3B-Chat
68.8
67.8
1761
1482
279
-
33.9
43.7
-
-
87.4
-
openbmb/MiniCPM-V-2
67.1
69.6
1808
-
-
-
38.2
-
38.7
-
-
-
VILA1.5-3B
67.9
63.4
-
1442
-
-
33.3
35.4
-
69.0
85.9
-
xtuner/llava-phi-3-mini-hf
70.0
69.2
1790
1477
313
43.7
41.4
-
-
73.7
87.3
69.3
xgen-mm-phi3-mini-instruct-r-v1 (Ours)
72.1
74.1
1827
1467
360
44.6
39.8
45.1
39.3
74.2
87.2
75.8
How to use
> We require the use of the development version (
"4.41.0.dev0"
) of the
transformers
library. To get it, as of 05/07/2024, one can use
pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers.
from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoImageProcessor, StoppingCriteria
import torch
import requests
from PIL import Image
# define the prompt templatedefapply_prompt_template(prompt):
s = (
'<|system|>\nA chat between a curious user and an artificial intelligence assistant. '"The assistant gives helpful, detailed, and polite answers to the user's questions.<|end|>\n"f'<|user|>\n<image>\n{prompt}<|end|>\n<|assistant|>\n'
)
return s
classEosListStoppingCriteria(StoppingCriteria):
def__init__(self, eos_sequence = [32007]):
self.eos_sequence = eos_sequence
def__call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
last_ids = input_ids[:,-len(self.eos_sequence):].tolist()
return self.eos_sequence in last_ids
# load models
model_name_or_path = "Salesforce/xgen-mm-phi3-mini-instruct-r-v1"
model = AutoModelForVision2Seq.from_pretrained(model_name_or_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, use_fast=False, legacy=False)
image_processor = AutoImageProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)
tokenizer = model.update_special_tokens(tokenizer)
# craft a test sample
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
query = "how many dogs are in the picture?"
model = model.cuda()
inputs = image_processor([raw_image], return_tensors="pt", image_aspect_ratio='anyres')
prompt = apply_prompt_template(query)
language_inputs = tokenizer([prompt], return_tensors="pt")
inputs.update(language_inputs)
inputs = {name: tensor.cuda() for name, tensor in inputs.items()}
generated_text = model.generate(**inputs, image_size=[raw_image.size],
pad_token_id=tokenizer.pad_token_id,
do_sample=False, max_new_tokens=768, top_p=None, num_beams=1,
stopping_criteria = [EosListStoppingCriteria()],
)
prediction = tokenizer.decode(generated_text[0], skip_special_tokens=True).split("<|end|>")[0]
print("==> prediction: ", prediction)
# output: ==> prediction: There is one dog in the picture.
More comprehensive examples can be found in the
notebook
.
Reproducibility:
Our SFT evaluation is based on the VLMEvalKit, in which we fixed some inconsistencies with the official benchmarks (e.g., LLM judge API). During our development, we noticed that the raw resolution of the input image would noticeably affect the model output in some cases.
Bias, Risks, Limitations, and Ethical Considerations
The main data sources are from the internet, including webpages,
image stock sites, and curated datasets released by the research community. We have excluded certain data, such as LAION, due to known CSAM concerns.
The model may be subject to bias from the original data source, as well as bias from LLMs and commercial APIs.
We strongly recommend users assess safety and fairness before applying to downstream applications.
License
Our code and weights are released under the Creative Commons Attribution Non Commercial 4.0
LICENSE
. Please fill out a form at
here
to consult the commercial use of model weights.
xgen-mm-phi3-mini-instruct-r-v1 huggingface.co is an AI model on huggingface.co that provides xgen-mm-phi3-mini-instruct-r-v1's model effect (), which can be used instantly with this Salesforce xgen-mm-phi3-mini-instruct-r-v1 model. huggingface.co supports a free trial of the xgen-mm-phi3-mini-instruct-r-v1 model, and also provides paid use of the xgen-mm-phi3-mini-instruct-r-v1. Support call xgen-mm-phi3-mini-instruct-r-v1 model through api, including Node.js, Python, http.
xgen-mm-phi3-mini-instruct-r-v1 huggingface.co is an online trial and call api platform, which integrates xgen-mm-phi3-mini-instruct-r-v1's modeling effects, including api services, and provides a free online trial of xgen-mm-phi3-mini-instruct-r-v1, you can try xgen-mm-phi3-mini-instruct-r-v1 online for free by clicking the link below.
Salesforce xgen-mm-phi3-mini-instruct-r-v1 online free url in huggingface.co:
xgen-mm-phi3-mini-instruct-r-v1 is an open source model from GitHub that offers a free installation service, and any user can find xgen-mm-phi3-mini-instruct-r-v1 on GitHub to install. At the same time, huggingface.co provides the effect of xgen-mm-phi3-mini-instruct-r-v1 install, users can directly use xgen-mm-phi3-mini-instruct-r-v1 installed effect in huggingface.co for debugging and trial. It also supports api for free installation.
xgen-mm-phi3-mini-instruct-r-v1 install url in huggingface.co: