florence-2-base replicate.com api & lucataco florence-2-base github AI Model

Introduction of florence-2-base

Model Details of florence-2-base

Readme

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Model Summary

This Hub repository contains a HuggingFace’s transformers implementation of Florence-2 model from Microsoft.

Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. Florence-2 can interpret simple text prompts to perform tasks like captioning, object detection, and segmentation. It leverages our FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. The model’s sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model.

Resources and Technical Documentation: + Florence-2 technical report . + Jupyter Notebook for inference and visualization of Florence-2-large model

Model	Model size	Model Description
Florence-2-base [HF]	0.23B	Pretrained model with FLD-5B
Florence-2-large [HF]	0.77B	Pretrained model with FLD-5B
Florence-2-base-ft [HF]	0.23B	Finetuned model on a colletion of downstream tasks
Florence-2-large-ft [HF]	0.77B	Finetuned model on a colletion of downstream tasks

Tasks

This model is capable of performing different tasks through changing the prompts.

Here are the tasks Florence-2 could perform:

Caption

prompt = "<CAPTION>"
run_example(prompt)

Detailed Caption

prompt = "<DETAILED_CAPTION>"
run_example(prompt)

More Detailed Caption

prompt = "<MORE_DETAILED_CAPTION>"
run_example(prompt)

Caption to Phrase Grounding

caption to phrase grounding task requires additional text input, i.e. caption.

Caption to phrase grounding results format: {‘\<CAPTION_TO_PHRASE_GROUNDING>‘: {‘bboxes’: [[x1, y1, x2, y2], …], ‘labels’: [‘’, ‘’, …]}}

task_prompt = "<CAPTION_TO_PHRASE_GROUNDING>"
results = run_example(task_prompt, text_input="A green car parked in front of a yellow building.")

Object Detection

OD results format: {‘\<OD>‘: {‘bboxes’: [[x1, y1, x2, y2], …], ‘labels’: [‘label1’, ‘label2’, …]} }

prompt = "<OD>"
run_example(prompt)

Dense Region Caption

Dense region caption results format: {‘\<DENSE_REGION_CAPTION>’ : {‘bboxes’: [[x1, y1, x2, y2], …], ‘labels’: [‘label1’, ‘label2’, …]} }

prompt = "<DENSE_REGION_CAPTION>"
run_example(prompt)

Region proposal

Dense region caption results format: {‘\<REGION_PROPOSAL>‘: {‘bboxes’: [[x1, y1, x2, y2], …], ‘labels’: [‘’, ‘’, …]}}

prompt = "<REGION_PROPOSAL>"
run_example(prompt)

OCR

prompt = "<OCR>"
run_example(prompt)

OCR with Region

OCR with region output format: {‘\<OCR_WITH_REGION>‘: {‘quad_boxes’: [[x1, y1, x2, y2, x3, y3, x4, y4], …], ‘labels’: [‘text1’, …]}}

prompt = "<OCR_WITH_REGION>"
run_example(prompt)

for More detailed examples, please refer to notebook </details>

Benchmarks

Florence-2 Zero-shot performance

The following table presents the zero-shot performance of generalist vision foundation models on image captioning and object detection evaluation tasks. These models have not been exposed to the training data of the evaluation tasks during their training phase.

Method	#params	COCO Cap. test CIDEr	NoCaps val CIDEr	TextCaps val CIDEr	COCO Det. val2017 mAP
Flamingo	80B	84.3	-	-	-
Florence-2-base	0.23B	133.0	118.7	70.1	34.7
Florence-2-large	0.77B	135.6	120.8	72.8	37.5

The following table continues the comparison with performance on other vision-language evaluation tasks.

Method	Flickr30k test R@1	Refcoco val Accuracy	Refcoco test-A Accuracy	Refcoco test-B Accuracy	Refcoco+ val Accuracy	Refcoco+ test-A Accuracy	Refcoco+ test-B Accuracy	Refcocog val Accuracy	Refcocog test Accuracy	Refcoco RES val mIoU
Kosmos-2	78.7	52.3	57.4	47.3	45.5	50.7	42.2	60.6	61.7	-
Florence-2-base	83.6	53.9	58.4	49.7	51.5	56.4	47.9	66.3	65.1	34.6
Florence-2-large	84.4	56.3	61.6	51.4	53.6	57.9	49.9	68.0	67.0	35.8

Florence-2 finetuned performance

We finetune Florence-2 models with a collection of downstream tasks, resulting two generalist models Florence-2-base-ft and Florence-2-large-ft that can conduct a wide range of downstream tasks.

The table below compares the performance of specialist and generalist models on various captioning and Visual Question Answering (VQA) tasks. Specialist models are fine-tuned specifically for each task, whereas generalist models are fine-tuned in a task-agnostic manner across all tasks. The symbol “▲” indicates the usage of external OCR as input.

Method	# Params	COCO Caption Karpathy test CIDEr	NoCaps val CIDEr	TextCaps val CIDEr	VQAv2 test-dev Acc	TextVQA test-dev Acc	VizWiz VQA test-dev Acc
Specialist Models
CoCa	2.1B	143.6	122.4	-	82.3	-	-
BLIP-2	7.8B	144.5	121.6	-	82.2	-	-
GIT2	5.1B	145.0	126.9	148.6	81.7	67.3	71.0
Flamingo	80B	138.1	-	-	82.0	54.1	65.7
PaLI	17B	149.1	127.0	160.0▲	84.3	58.8 / 73.1▲	71.6 / 74.4▲
PaLI-X	55B	149.2	126.3	147.0 / 163.7▲	86.0	71.4 / 80.8▲	70.9 / 74.6▲
Generalist Models
Unified-IO	2.9B	-	100.0	-	77.9	-	57.4
Florence-2-base-ft	0.23B	140.0	116.7	143.9	79.7	63.6	63.6
Florence-2-large-ft	0.77B	143.3	124.9	151.1	81.7	73.5	72.6

Method	# Params	COCO Det. val2017 mAP	Flickr30k test R@1	RefCOCO val Accuracy	RefCOCO test-A Accuracy	RefCOCO test-B Accuracy	RefCOCO+ val Accuracy	RefCOCO+ test-A Accuracy	RefCOCO+ test-B Accuracy	RefCOCOg val Accuracy	RefCOCOg test Accuracy	RefCOCO RES val mIoU
Specialist Models
SeqTR	-	-	-	83.7	86.5	81.2	71.5	76.3	64.9	74.9	74.2	-
PolyFormer	-	-	-	90.4	92.9	87.2	85.0	89.8	78.0	85.8	85.9	76.9
UNINEXT	0.74B	60.6	-	92.6	94.3	91.5	85.2	89.6	79.8	88.7	89.4	-
Ferret	13B	-	-	89.5	92.4	84.4	82.8	88.1	75.2	85.8	86.3	-
Generalist Models
UniTAB	-	-	-	88.6	91.1	83.8	81.0	85.4	71.6	84.6	84.7	-
Florence-2-base-ft	0.23B	41.4	84.0	92.6	94.8	91.5	86.8	91.7	82.2	89.8	82.2	78.0
Florence-2-large-ft	0.77B	43.4	85.2	93.4	95.3	92.0	88.3	92.9	83.6	91.2	91.7	80.5

BibTex and citation info

@article{xiao2023florence,
  title={Florence-2: Advancing a unified representation for a variety of vision tasks},
  author={Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu},
  journal={arXiv preprint arXiv:2311.06242},
  year={2023}
}

Pricing of florence-2-base replicate.com

Run time and cost

This model costs approximately $0.0051 to run on Replicate, or 196 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker .

This model runs on Nvidia A40 GPU hardware . Predictions typically complete within 9 seconds. The predict time for this model varies significantly based on the inputs.

Runs of lucataco florence-2-base on replicate.com

31.1K

Total runs

24-hour runs

3-day runs

7-day runs

1000

30-day runs

More Information About florence-2-base replicate.com Model

More florence-2-base license Visit here:

https://huggingface.co/microsoft/Florence-2-base/resolve/main/LICENSE

florence-2-base replicate.com

florence-2-base replicate.com is an AI model on replicate.com that provides florence-2-base's model effect (Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks), which can be used instantly with this lucataco florence-2-base model. replicate.com supports a free trial of the florence-2-base model, and also provides paid use of the florence-2-base. Support call florence-2-base model through api, including Node.js, Python, http.

florence-2-base replicate.com Url

https://replicate.com/lucataco/florence-2-base

lucataco florence-2-base online free

florence-2-base replicate.com is an online trial and call api platform, which integrates florence-2-base's modeling effects, including api services, and provides a free online trial of florence-2-base, you can try florence-2-base online for free by clicking the link below.

lucataco florence-2-base online free url in replicate.com:

https://replicate.com/lucataco/florence-2-base

florence-2-base install

florence-2-base is an open source model from GitHub that offers a free installation service, and any user can find florence-2-base on GitHub to install. At the same time, replicate.com provides the effect of florence-2-base install, users can directly use florence-2-base installed effect in replicate.com for debugging and trial. It also supports api for free installation.

florence-2-base install url in replicate.com:

https://replicate.com/lucataco/florence-2-base

florence-2-base install url in github:

https://github.com/lucataco/cog-florence-2-base

lucataco / florence-2-base

Introduction of florence-2-base

Model Details of florence-2-base

Readme

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Model Summary

Tasks

Caption

Detailed Caption

More Detailed Caption

Caption to Phrase Grounding

Object Detection

Dense Region Caption

Region proposal

OCR

OCR with Region

Benchmarks

Florence-2 Zero-shot performance

Florence-2 finetuned performance

BibTex and citation info

Pricing of florence-2-base replicate.com

Run time and cost

Runs of lucataco florence-2-base on replicate.com

More Information About florence-2-base replicate.com Model

More florence-2-base license Visit here:

florence-2-base replicate.com

florence-2-base replicate.com Url

lucataco florence-2-base online free

lucataco florence-2-base online free url in replicate.com:

florence-2-base install

florence-2-base install url in replicate.com:

florence-2-base install url in github:

Url of florence-2-base

florence-2-base replicate.com Url

florence-2-base Github

florence-2-base Owner Github

Provider of florence-2-base replicate.com

Other API from lucataco