VLE
(
V
isual-
L
anguage
E
ncoder) is an image-text multimodal understanding model built on the pre-trained text and image encoders.
It can be used for multimodal discriminative tasks such as visual question answering and image-text retrieval.
Especially on the visual commonsense reasoning (VCR) task, which requires high-level language understanding and reasoning skills, VLE achieves significant improvements.
For more details see
https://github.com/iflytek/VLE
.
Online VLE demo on Visual Question Answering:
https://huggingface.co/spaces/hfl/VQA_VLE_LLM