[2024/01/19] We open source the
ViSFT
including training scripts and weights. Evaluation codes will be released soon.
Introduction
Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years. Subsequent efforts have been made to introduce region-level visual learning into CLIP’s pretraining but face scalability challenges due to the lack of large-scale region-level datasets. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuning, we explore the potential of fine-grained SFT in enhancing the generation of vision foundation models after their pretraining. Thus a two-stage method
ViSFT
(
Vi
sion
SFT
) is proposed to unleash the fine-grained knowledge of vision foundation models. In ViSFT, the vision foundation model is enhanced by performing visual joint learning on some in-domain tasks and then tested on out-of-domain benchmarks. With updating using ViSFT on 8 V100 GPUs in less than 2 days, a vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks including vision and vision-linguistic scenarios.
# can be executed in parallel
bash ./scripts/stage1_train/eva_e/caption.sh
bash ./scripts/stage1_train/eva_e/detection.sh
bash ./scripts/stage1_train/eva_e/segment.sh
The code of ViSFT is based on the official implementation of
mmf
,
EVA
and
LAVIS
Citation
If you found our work valuable, please cite:
@misc{jiang2024supervised,
title={Supervised Fine-tuning in turn Improves Visual Foundation Models},
author={Xiaohu Jiang and Yixiao Ge and Yuying Ge and Chun Yuan and Ying Shan},
year={2024},
eprint={2401.10222},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
ViSFT huggingface.co is an AI model on huggingface.co that provides ViSFT's model effect (), which can be used instantly with this TencentARC ViSFT model. huggingface.co supports a free trial of the ViSFT model, and also provides paid use of the ViSFT. Support call ViSFT model through api, including Node.js, Python, http.
ViSFT huggingface.co is an online trial and call api platform, which integrates ViSFT's modeling effects, including api services, and provides a free online trial of ViSFT, you can try ViSFT online for free by clicking the link below.
TencentARC ViSFT online free url in huggingface.co:
ViSFT is an open source model from GitHub that offers a free installation service, and any user can find ViSFT on GitHub to install. At the same time, huggingface.co provides the effect of ViSFT install, users can directly use ViSFT installed effect in huggingface.co for debugging and trial. It also supports api for free installation.