Video-LLaVA-7B huggingface.co api & LanguageBind Video-LLaVA-7B github AI Model

Introduction of Video-LLaVA-7B

Model Details of Video-LLaVA-7B

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

If you like our project, please give us a star ⭐ on GitHub for latest update.

📰 News

[2024.01.27] 👀👀👀 Our MoE-LLaVA is released! A sparse model with 3B parameters outperformed the dense model with 7B parameters.
[2024.01.17] 🔥🔥🔥 Our LanguageBind has been accepted at ICLR 2024!
[2024.01.16] 🔥🔥🔥 We reorganize the code and support LoRA fine-tuning, checking finetune_lora.sh .
[2023.11.30] 🤝 Thanks to the generous contributions of the community, the OpenXLab's demo is now accessible.
[2023.11.23] We are training a new and powerful model.
[2023.11.21] 🤝 Check out the replicate demo , created by @nateraw , who has generously supported our research!
[2023.11.20] 🤗 Hugging Face demo and all codes & datasets are available now! Welcome to watch 👀 this repository for the latest updates.

😮 Highlights

Video-LLaVA exhibits remarkable interactive capabilities between images and videos, despite the absence of image-video pairs in the dataset.

💡 Simple baseline, learning united visual representation by alignment before projection

With the binding of unified visual representations to the language feature space , we enable an LLM to perform visual reasoning capabilities on both images and videos simultaneously.

🔥 High performance, complementary learning with video and image

Extensive experiments demonstrate the complementarity of modalities , showcasing significant superiority when compared to models specifically designed for either images or videos.

🤗 Demo

Gradio Web UI

Highly recommend trying out our web demo by the following command, which incorporates all features currently supported by Video-LLaVA. We also provide online demo in Huggingface Spaces.

python -m  videollava.serve.gradio_web_server

CLI Inference

python -m videollava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --file "path/to/your/video.mp4" --load-4bit

python -m videollava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --file "path/to/your/image.jpg" --load-4bit

🛠️ Requirements and Installation

Python >= 3.10
Pytorch == 2.0.1
CUDA Version >= 11.7
Install required packages:

git clone https://github.com/PKU-YuanGroup/Video-LLaVA
cd Video-LLaVA
conda create -n videollava python=3.10 -y
conda activate videollava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install decord opencv-python git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d

🤖 API

We open source all codes. If you want to load the model (e.g. LanguageBind/Video-LLaVA-7B ) on local, you can use the following code snippets.

Inference for image

import torch
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

def main():
    disable_torch_init()
    image = 'videollava/serve/examples/extreme_ironing.jpg'
    inp = 'What is unusual about this image?'
    model_path = 'LanguageBind/Video-LLaVA-7B'
    cache_dir = 'cache_dir'
    device = 'cuda'
    load_4bit, load_8bit = True, False
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, _ = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device, cache_dir=cache_dir)
    image_processor = processor['image']
    conv_mode = "llava_v1"
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles

    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values']
    if type(image_tensor) is list:
        tensor = [image.to(model.device, dtype=torch.float16) for image in image_tensor]
    else:
        tensor = image_tensor.to(model.device, dtype=torch.float16)

    print(f"{roles[1]}: {inp}")
    inp = DEFAULT_IMAGE_TOKEN + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=tensor,
            do_sample=True,
            temperature=0.2,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
    print(outputs)

if __name__ == '__main__':
    main()

Inference for video

import torch
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

def main():
    disable_torch_init()
    video = 'videollava/serve/examples/sample_demo_1.mp4'
    inp = 'Why is this video funny?'
    model_path = 'LanguageBind/Video-LLaVA-7B'
    cache_dir = 'cache_dir'
    device = 'cuda'
    load_4bit, load_8bit = True, False
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, _ = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device, cache_dir=cache_dir)
    video_processor = processor['video']
    conv_mode = "llava_v1"
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles

    video_tensor = video_processor(video, return_tensors='pt')['pixel_values']
    if type(video_tensor) is list:
        tensor = [video.to(model.device, dtype=torch.float16) for video in video_tensor]
    else:
        tensor = video_tensor.to(model.device, dtype=torch.float16)

    print(f"{roles[1]}: {inp}")
    inp = ' '.join([DEFAULT_IMAGE_TOKEN] * model.get_video_tower().config.num_frames) + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=tensor,
            do_sample=True,
            temperature=0.1,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
    print(outputs)

if __name__ == '__main__':
    main()

🗝️ Training & Validating

The training & validating instruction is in TRAIN_AND_VALIDATE.md .

👍 Acknowledgement

LLaVA The codebase we built upon and it is an efficient large language and vision assistant.
Video-ChatGPT Great job contributing the evaluation code and dataset.

🙌 Related Projects

LanguageBind An open source five modalities language-based retrieval framework.
Chat-UniVi This framework empowers the model to efficiently utilize a limited number of visual tokens.

🔒 License

The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violation.

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.

@article{lin2023video,
  title={Video-LLaVA: Learning United Visual Representation by Alignment Before Projection},
  author={Lin, Bin and Zhu, Bin and Ye, Yang and Ning, Munan and Jin, Peng and Yuan, Li},
  journal={arXiv preprint arXiv:2311.10122},
  year={2023}
}

@article{zhu2023languagebind,
  title={LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment},
  author={Zhu, Bin and Lin, Bin and Ning, Munan and Yan, Yang and Cui, Jiaxi and Wang, HongFa and Pang, Yatian and Jiang, Wenhao and Zhang, Junwu and Li, Zongwei and others},
  journal={arXiv preprint arXiv:2310.01852},
  year={2023}
}

✨ Star History

🤝 Contributors

Runs of LanguageBind Video-LLaVA-7B on huggingface.co

24.1K

Total runs

24-hour runs

-272

3-day runs

-94

7-day runs

19.8K

30-day runs

More Information About Video-LLaVA-7B huggingface.co Model

Video-LLaVA-7B huggingface.co

Video-LLaVA-7B huggingface.co is an AI model on huggingface.co that provides Video-LLaVA-7B's model effect (), which can be used instantly with this LanguageBind Video-LLaVA-7B model. huggingface.co supports a free trial of the Video-LLaVA-7B model, and also provides paid use of the Video-LLaVA-7B. Support call Video-LLaVA-7B model through api, including Node.js, Python, http.

Video-LLaVA-7B huggingface.co Url

https://huggingface.co/LanguageBind/Video-LLaVA-7B

LanguageBind Video-LLaVA-7B online free

Video-LLaVA-7B huggingface.co is an online trial and call api platform, which integrates Video-LLaVA-7B's modeling effects, including api services, and provides a free online trial of Video-LLaVA-7B, you can try Video-LLaVA-7B online for free by clicking the link below.

LanguageBind Video-LLaVA-7B online free url in huggingface.co:

https://huggingface.co/LanguageBind/Video-LLaVA-7B

Video-LLaVA-7B install

Video-LLaVA-7B is an open source model from GitHub that offers a free installation service, and any user can find Video-LLaVA-7B on GitHub to install. At the same time, huggingface.co provides the effect of Video-LLaVA-7B install, users can directly use Video-LLaVA-7B installed effect in huggingface.co for debugging and trial. It also supports api for free installation.

Video-LLaVA-7B install url in huggingface.co:

https://huggingface.co/LanguageBind/Video-LLaVA-7B

huggingface.co

LanguageBind/LanguageBind_Image

Total runs: 203.9K

Run Growth: 143.0K

Growth Rate: 70.11%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/LanguageBind_Video_merge

Total runs: 186.5K

Run Growth: 124.5K

Growth Rate: 66.75%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/Video-LLaVA-7B-hf

Total runs: 22.1K

Run Growth: 7.5K

Growth Rate: 33.83%

Updated: Maio 16 2024

huggingface.co

LanguageBind/LanguageBind_Video_FT

Total runs: 16.5K

Run Growth: -33.3K

Growth Rate: -202.08%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/LanguageBind_Audio_FT

Total runs: 5.1K

Run Growth: 1.3K

Growth Rate: 25.09%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/LanguageBind_Video_V1.5_FT

Total runs: 2.0K

Run Growth: 648

Growth Rate: 32.73%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/MoE-LLaVA-StableLM-1.6B-4e

Total runs: 1.6K

Run Growth: -2.0K

Growth Rate: -122.14%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/MoE-LLaVA-Phi2-2.7B-4e

Total runs: 518

Run Growth: 348

Growth Rate: 67.18%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/LanguageBind_Thermal

Total runs: 508

Run Growth: 318

Growth Rate: 62.60%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/LanguageBind_Depth

Total runs: 454

Run Growth: 264

Growth Rate: 58.15%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/LanguageBind_Audio

Total runs: 435

Run Growth: 186

Growth Rate: 42.76%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/MoE-LLaVA-Qwen-1.8B-4e

Total runs: 406

Run Growth: 180

Growth Rate: 44.33%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/LanguageBind_Video

Total runs: 401

Run Growth: 210

Growth Rate: 52.37%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/LanguageBind_Video_Huge_V1.5_FT

Total runs: 389

Run Growth: -140

Growth Rate: -35.99%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/MoE-LLaVA-Phi2-2.7B-4e-384

Total runs: 276

Run Growth: -701

Growth Rate: -253.99%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/MoE-LLaVA-Qwen-Stage2

Total runs: 116

Run Growth: 98

Growth Rate: 84.48%

Updated: Março 16 2024

huggingface.co

LanguageBind/MoE-LLaVA-Phi2-Stage2

Total runs: 77

Run Growth: 69

Growth Rate: 89.61%

Updated: Março 16 2024

huggingface.co

LanguageBind/MoE-LLaVA-StableLM-1.6B-4e-384

Total runs: 67

Run Growth: -10

Growth Rate: -14.93%

Updated: Fevereiro 03 2024

huggingface.co

LanguageBind/MoE-LLaVA-Phi2-Stage2-384

Total runs: 58

Run Growth: 48

Growth Rate: 82.76%

Updated: Março 16 2024

huggingface.co

LanguageBind/MoE-LLaVA-StableLM-Stage2

Total runs: 50

Run Growth: 41

Growth Rate: 82.00%

Updated: Março 16 2024

huggingface.co

LanguageBind/MoE-LLaVA-StableLM-Stage2-384

Total runs: 50

Run Growth: 41

Growth Rate: 82.00%

Updated: Março 16 2024

huggingface.co

LanguageBind/Video-LLaVA-Pretrain-7B

Total runs: 40

Run Growth: 22

Growth Rate: 55.00%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/MoE-LLaVA-Qwen-Pretrain

Total runs: 21

Run Growth: -4

Growth Rate: -19.05%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/MoE-LLaVA-OpenChat-7B-4e

Total runs: 19

Run Growth: 7

Growth Rate: 36.84%

Updated: Fevereiro 02 2024

huggingface.co

LanguageBind/MoE-LLaVA-StableLM-Pretrain

Total runs: 18

Run Growth: 10

Growth Rate: 55.56%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/MoE-LLaVA-Phi2-384-Pretrain

Total runs: 17

Run Growth: 7

Growth Rate: 41.18%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/MoE-LLaVA-Phi2-Pretrain

Total runs: 16

Run Growth: 8

Growth Rate: 50.00%

Updated: Fevereiro 01 2024

huggingface.co

LanguageBind/Open-Sora-Plan-v1.3.0

Total runs: 8

Run Growth: 5

Growth Rate: 62.50%

Updated: Dezembro 05 2024

huggingface.co

LanguageBind/MoE-LLaVA-StableLM-384-Pretrain

Total runs: 7

Run Growth: 2

Growth Rate: 28.57%

Updated: Março 16 2024

huggingface.co

LanguageBind/Open-Sora-Plan-v1.0.0

Total runs: 4

Run Growth: 2

Growth Rate: 50.00%

Updated: Abril 07 2024

huggingface.co

LanguageBind/Open-Sora-Plan-v1.1.0

Total runs: 3

Run Growth: 2

Growth Rate: 66.67%

Updated: Maio 27 2024

huggingface.co

LanguageBind/Open-Sora-Plan-v1.2.0

Total runs: 2

Run Growth: 0

Growth Rate: 0.00%

Updated: Setembro 07 2024

huggingface.co

LanguageBind/Video-LLaVA-V1.5

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: Novembro 26 2023

huggingface.co

LanguageBind/offline_feature

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: Janeiro 29 2025

huggingface.co

LanguageBind/LanguageBind_Audio_V1.5

Total runs: 0

Run Growth: 0

Growth Rate: 0.00%

Updated: Fevereiro 01 2024

LanguageBind / Video-LLaVA-7B

Introduction of Video-LLaVA-7B

Model Details of Video-LLaVA-7B

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

If you like our project, please give us a star ⭐ on GitHub for latest update.

📰 News

😮 Highlights

💡 Simple baseline, learning united visual representation by alignment before projection

🔥 High performance, complementary learning with video and image

🤗 Demo

Gradio Web UI

CLI Inference

🛠️ Requirements and Installation

🤖 API

Inference for image

Inference for video

🗝️ Training & Validating

👍 Acknowledgement

🙌 Related Projects

🔒 License

✏️ Citation

✨ Star History

🤝 Contributors

Runs of LanguageBind Video-LLaVA-7B on huggingface.co

More Information About Video-LLaVA-7B huggingface.co Model

More Video-LLaVA-7B license Visit here:

Video-LLaVA-7B huggingface.co

Video-LLaVA-7B huggingface.co Url

LanguageBind Video-LLaVA-7B online free

LanguageBind Video-LLaVA-7B online free url in huggingface.co:

Video-LLaVA-7B install

Video-LLaVA-7B install url in huggingface.co:

Url of Video-LLaVA-7B

Video-LLaVA-7B huggingface.co Url

Provider of Video-LLaVA-7B huggingface.co

Other API from LanguageBind