We launch
EVA
, a vision-centric foundation model to
E
xplore the limits of
V
isual representation at sc
A
le using only publicly accessible data and academic resources.
EVA
is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features (
i.e.
, CLIP features) conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks.
EVA is the first open-sourced billion-scale vision foundation model that achieves state-of-the-art performance on a broad range of downstream tasks.
We provide
all pre-trained & fine-tuned
EVAs for the community.
The following table summarizes the basic statistics of MIM pre-trained EVA and image classification EVA.
eva_psz14to16
model interpolates the kernel size of
patch_embed
from
14x14
to
16x16
. This is useful for object detection, instance segmentation & semantic segmentation,
etc
. See
interpolate_patch_14to16.py
for implementation details.
For MIM pre-trained EVA and EVA-CLIP, we use
deepspeed
fp16
format. IN-1K fine-tuned EVA weights are larger (
4GB
v.s.
2GB
) because ema updates models with
fp32
format. The weights of other downstream tasks are also with
fp32
format.
The ImageNet-1K zero-shot classification performance is higher than our paper (
78.5
v.s.
78.2
) because of longer training.
We choose to train a 1.3B CLIP model, not because it is easy, but because it is hard. Please refer to
this note
for a glance of the challenges in training very large CLIP.
To our knowledge, EVA-CLIP is
the largest performant open-sourced CLIP model
evaluated via zero-shot classification performance.
We will updates the results in our paper soon.
For more details of EVA-CLIP, please refer to Section 2.3.5 of
our paper
.
We hope open-sourcing EVA-CLIP can facilitate future research in multi-modal learning, representation leaning, AIGC,
etc
.
Citation
If you find our work helpful, please star this repo and cite the related articles. Thanks for your support!
@article{EVA,
title={EVA: Exploring the Limits of Masked Visual Representation Learning at Scale},
author={Fang, Yuxin and Wang, Wen and Xie, Binhui and Sun, Quan and Wu, Ledell and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
journal={arXiv preprint arXiv:2211.07636},
year={2022}
}
License
The content of this project itself is licensed under the MIT License.
Contact
For help or issues using EVA, please open a GitHub
issue
.
We are hiring
at all levels at BAAI Vision Team, including full-time researchers, engineers and interns.
If you are interested in working with us on
foundation model, self-supervised learning and multimodal learning
, please contact
Yue Cao
(
[email protected]
) and
Xinlong Wang
(
[email protected]
).
EVA huggingface.co is an AI model on huggingface.co that provides EVA's model effect (), which can be used instantly with this BAAI EVA model. huggingface.co supports a free trial of the EVA model, and also provides paid use of the EVA. Support call EVA model through api, including Node.js, Python, http.
EVA huggingface.co is an online trial and call api platform, which integrates EVA's modeling effects, including api services, and provides a free online trial of EVA, you can try EVA online for free by clicking the link below.
EVA is an open source model from GitHub that offers a free installation service, and any user can find EVA on GitHub to install. At the same time, huggingface.co provides the effect of EVA install, users can directly use EVA installed effect in huggingface.co for debugging and trial. It also supports api for free installation.