lucataco / qwen2-vl-7b-instruct

Latest model in the Qwen family for chatting with video and image models

replicate.com
Total runs: 12.6K
24-hour runs: 0
7-day runs: 0
30-day runs: 0
Github
Model's Last Updated: 2024年12月20日

Introduction of qwen2-vl-7b-instruct

Model Details of qwen2-vl-7b-instruct

Readme

Qwen2-VL-7B-Instruct

Introduction

We’re excited to unveil Qwen2-VL , the latest iteration of our Qwen-VL model, representing nearly a year of innovation.

What’s New in Qwen2-VL?
Key Enhancements:
  • SoTA understanding of images of various resolution & ratio : Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.

  • Understanding videos of 20min+ : Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.

  • Agent that can operate your mobiles, robots, etc. : with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.

  • Multilingual Support : to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

Model Architecture Updates:
  • Naive Dynamic Resolution : Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience.

  • Multimodal Rotary Position Embedding (M-ROPE) : Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities.

We have three models with 2, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2-VL model. For more information, visit our Blog and GitHub .

Requirements

The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command pip install git+ https://github.com/huggingface/transformers , or you might encounter the following error:

Quickstart

We offer a toolkit to help you handle various types of visual input more conveniently. This includes base64, URLs, and interleaved images and videos.

More Usage Tips

For input images, we support local files, base64, and URLs. For videos, we currently only support local files.

Image Resolution for performance boost

The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage.

Besides, We provide two methods for fine-grained control over the image size input to the model:

  1. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.

  2. Specify exact dimensions: Directly set resized_height and resized_width . These values will be rounded to the nearest multiple of 28.

Limitations

While Qwen2-VL are applicable to a wide range of visual tasks, it is equally important to understand its limitations. Here are some known restrictions:

  1. Lack of Audio Support: The current model does not comprehend audio information within videos.
  2. Data timeliness: Our image dataset is updated until June 2023 , and information subsequent to this date may not be covered.
  3. Constraints in Individuals and Intellectual Property (IP): The model’s capacity to recognize specific individuals or IPs is limited, potentially failing to comprehensively cover all well-known personalities or brands.
  4. Limited Capacity for Complex Instruction: When faced with intricate multi-step instructions, the model’s understanding and execution capabilities require enhancement.
  5. Insufficient Counting Accuracy: Particularly in complex scenes, the accuracy of object counting is not high, necessitating further improvements.
  6. Weak Spatial Reasoning Skills: Especially in 3D spaces, the model’s inference of object positional relationships is inadequate, making it difficult to precisely judge the relative positions of objects.

These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model’s performance and scope of application.

Citation

If you find our work helpful, feel free to give us a cite.

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}

Runs of lucataco qwen2-vl-7b-instruct on replicate.com

12.6K
Total runs
0
24-hour runs
0
3-day runs
0
7-day runs
0
30-day runs

More Information About qwen2-vl-7b-instruct replicate.com Model

qwen2-vl-7b-instruct replicate.com

qwen2-vl-7b-instruct replicate.com is an AI model on replicate.com that provides qwen2-vl-7b-instruct's model effect (Latest model in the Qwen family for chatting with video and image models), which can be used instantly with this lucataco qwen2-vl-7b-instruct model. replicate.com supports a free trial of the qwen2-vl-7b-instruct model, and also provides paid use of the qwen2-vl-7b-instruct. Support call qwen2-vl-7b-instruct model through api, including Node.js, Python, http.

qwen2-vl-7b-instruct replicate.com Url

https://replicate.com/lucataco/qwen2-vl-7b-instruct

lucataco qwen2-vl-7b-instruct online free

qwen2-vl-7b-instruct replicate.com is an online trial and call api platform, which integrates qwen2-vl-7b-instruct's modeling effects, including api services, and provides a free online trial of qwen2-vl-7b-instruct, you can try qwen2-vl-7b-instruct online for free by clicking the link below.

lucataco qwen2-vl-7b-instruct online free url in replicate.com:

https://replicate.com/lucataco/qwen2-vl-7b-instruct

qwen2-vl-7b-instruct install

qwen2-vl-7b-instruct is an open source model from GitHub that offers a free installation service, and any user can find qwen2-vl-7b-instruct on GitHub to install. At the same time, replicate.com provides the effect of qwen2-vl-7b-instruct install, users can directly use qwen2-vl-7b-instruct installed effect in replicate.com for debugging and trial. It also supports api for free installation.

qwen2-vl-7b-instruct install url in replicate.com:

https://replicate.com/lucataco/qwen2-vl-7b-instruct

qwen2-vl-7b-instruct install url in github:

https://github.com/lucataco/cog-qwen2-vl-7b-instruct

Url of qwen2-vl-7b-instruct

qwen2-vl-7b-instruct replicate.com Url

qwen2-vl-7b-instruct Owner Github

Provider of qwen2-vl-7b-instruct replicate.com

Other API from lucataco

replicate

Remove background from an image

Total runs: 5.3M
Run Growth: 100.0K
Growth Rate: 1.89%
Updated: 2023年9月15日
replicate

Falcons.ai Fine-Tuned Vision Transformer (ViT) for NSFW Image Classification

Total runs: 4.5M
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年11月21日
replicate

Implementation of Realistic Vision v5.1 with VAE

Total runs: 3.7M
Run Growth: 400.0K
Growth Rate: 10.81%
Updated: 2023年8月15日
replicate

FLUX.1-Dev LoRA Explorer

Total runs: 2.9M
Run Growth: 300.0K
Growth Rate: 10.34%
Updated: 2024年10月5日
replicate

SDXL ControlNet - Canny

Total runs: 2.1M
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年10月4日
replicate

SDXL Inpainting by the HF Diffusers team

Total runs: 1.8M
Run Growth: 200.0K
Growth Rate: 11.11%
Updated: 2024年3月5日
replicate

Turn any image into a video

Total runs: 1.3M
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年9月2日
replicate

Segmind Stable Diffusion Model (SSD-1B) is a distilled 50% smaller version of SDXL, offering a 60% speedup while maintaining high-quality text-to-image generation capabilities

Total runs: 992.2K
Run Growth: 600
Growth Rate: 0.06%
Updated: 2023年11月8日
replicate

Hyper FLUX 8-step by ByteDance

Total runs: 926.0K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2024年8月28日
replicate

CLIP Interrogator for SDXL optimizes text prompts to match a given image

Total runs: 845.9K
Run Growth: 300
Growth Rate: 0.04%
Updated: 2024年5月16日
replicate

FLUX.1-Dev Multi LoRA Explorer

Total runs: 827.2K
Run Growth: 144.0K
Growth Rate: 17.52%
Updated: 2024年10月6日
replicate

A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.

Total runs: 796.6K
Run Growth: 1.4K
Growth Rate: 0.18%
Updated: 2023年10月15日
replicate

Robust face restoration algorithm for old photos/AI-generated faces

Total runs: 768.8K
Run Growth: 149.5K
Growth Rate: 19.45%
Updated: 2023年9月6日
replicate

FLUX.1-Schnell LoRA Explorer

Total runs: 702.6K
Run Growth: 97.5K
Growth Rate: 13.97%
Updated: 2024年9月7日
replicate

Coqui XTTS-v2: Multilingual Text To Speech Voice Cloning

Total runs: 644.5K
Run Growth: 46.6K
Growth Rate: 7.23%
Updated: 2023年11月28日
replicate

SDXL v1.0 - A text-to-image generative AI model that creates beautiful images

Total runs: 477.4K
Run Growth: 100
Growth Rate: 0.02%
Updated: 2023年11月1日
replicate

😊 Hotshot-XL is an AI text-to-GIF model trained to work alongside Stable Diffusion XL

Total runs: 464.2K
Run Growth: 37.0K
Growth Rate: 7.97%
Updated: 2023年10月23日
replicate

snowflake-arctic-embed is a suite of text embedding models that focuses on creating high-quality retrieval models optimized for performance

Total runs: 397.2K
Run Growth: 300
Growth Rate: 0.08%
Updated: 2024年4月19日
replicate

Latent Consistency Model (LCM): SDXL, distills the original model into a version that requires fewer steps (4 to 8 instead of the original 25 to 50)

Total runs: 394.2K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年11月12日
replicate

Monster Labs QrCode ControlNet on top of SD Realistic Vision v5.1

Total runs: 387.3K
Run Growth: 1.5K
Growth Rate: 0.39%
Updated: 2023年9月24日
replicate

moondream2 is a small vision language model designed to run efficiently on edge devices

Total runs: 291.3K
Run Growth: 48.3K
Growth Rate: 16.63%
Updated: 2024年7月29日
replicate

RealvisXL-v2.0 with LCM LoRA - requires fewer steps (4 to 8 instead of the original 40 to 50)

Total runs: 290.9K
Run Growth: 200
Growth Rate: 0.07%
Updated: 2023年11月15日
replicate

Implementation of SDXL RealVisXL_V2.0

Total runs: 283.6K
Run Growth: 400
Growth Rate: 0.14%
Updated: 2023年11月9日
replicate

Animate Your Personalized Text-to-Image Diffusion Models

Total runs: 281.9K
Run Growth: 2.7K
Growth Rate: 0.96%
Updated: 2023年9月24日
replicate

Practical face restoration algorithm for *old photos* or *AI-generated faces* (for larger images)

Total runs: 234.2K
Run Growth: 8.4K
Growth Rate: 3.59%
Updated: 2023年8月2日
replicate

DreamShaper is a general purpose SD model that aims at doing everything well, photos, art, anime, manga. It's designed to match Midjourney and DALL-E.

Total runs: 194.5K
Run Growth: 3.8K
Growth Rate: 1.95%
Updated: 2023年12月19日
replicate

Real-ESRGAN Video Upscaler

Total runs: 137.7K
Run Growth: 19.3K
Growth Rate: 14.02%
Updated: 2023年11月24日
replicate

A unique fusion that showcases exceptional prompt adherence and semantic understanding, it seems to be a step above base SDXL and a step closer to DALLE-3 in terms of prompt comprehension

Total runs: 124.3K
Run Growth: 600
Growth Rate: 0.48%
Updated: 2023年12月27日
replicate

CLIP Interrogator (for faster inference)

Total runs: 122.0K
Run Growth: 400
Growth Rate: 0.33%
Updated: 2023年9月12日
replicate

dreamshaper-xl-lightning is a Stable Diffusion model that has been fine-tuned on SDXL

Total runs: 107.0K
Run Growth: 6.3K
Growth Rate: 5.89%
Updated: 2024年2月27日
replicate

Phi-3-Mini-4K-Instruct is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets

Total runs: 81.7K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2024年7月2日
replicate

SDXL_Niji_Special Edition

Total runs: 64.6K
Run Growth: 2.2K
Growth Rate: 3.41%
Updated: 2023年11月13日
replicate

PixArt-Alpha 1024px is a transformer-based text-to-image diffusion system trained on text embeddings from T5

Total runs: 64.0K
Run Growth: 11.4K
Growth Rate: 17.81%
Updated: 2023年12月4日
replicate

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

Total runs: 55.5K
Run Growth: 200
Growth Rate: 0.36%
Updated: 2023年12月5日
replicate

Dreamshaper-7 img2img with LCM LoRA for faster inference

Total runs: 55.1K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年11月17日
replicate

AI-driven audio enhancement for your audio files, powered by Resemble AI

Total runs: 52.3K
Run Growth: 9.2K
Growth Rate: 17.59%
Updated: 2023年12月15日
replicate

Ostris AI-Toolkit for Flux LoRA Training (DEPRECATED. Please use: ostris/flux-dev-lora-trainer)

Total runs: 51.2K
Run Growth: 3.1K
Growth Rate: 6.08%
Updated: 2024年8月18日
replicate

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Total runs: 49.5K
Run Growth: 3.4K
Growth Rate: 6.87%
Updated: 2024年6月25日
replicate

Implementation of SDXL RealVisXL_V1.0

Total runs: 44.0K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年9月13日
replicate

SDXL Image Blending

Total runs: 42.4K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年12月11日
replicate

(Academic and Non-commercial use only) Pixel-Aware Stable Diffusion for Realistic Image Super-resolution and Personalized Stylization

Total runs: 39.5K
Run Growth: 300
Growth Rate: 0.76%
Updated: 2024年1月8日
replicate

BakLLaVA-1 is a Mistral 7B base augmented with the LLaVA 1.5 architecture

Total runs: 39.0K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年10月24日
replicate

lmsys/vicuna-13b-v1.3

Total runs: 38.4K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年6月29日
replicate

Mistral-7B-v0.1 fine tuned for chat with the Dolphin dataset (an open-source implementation of Microsoft's Orca)

Total runs: 35.0K
Run Growth: 400
Growth Rate: 1.14%
Updated: 2023年10月31日
replicate

Real-ESRGAN with optional face correction and adjustable upscale (for larger images)

Total runs: 34.3K
Run Growth: 200
Growth Rate: 0.58%
Updated: 2023年7月17日
replicate

Gemma2 2b by Google

Total runs: 33.1K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2024年8月1日
replicate

The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate SDXL images with an image prompt

Total runs: 31.1K
Run Growth: 200
Growth Rate: 0.64%
Updated: 2023年11月11日
replicate

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Total runs: 31.1K
Run Growth: 1000
Growth Rate: 3.22%
Updated: 2024年6月25日
replicate

Realistic Vision v5.0 with VAE

Total runs: 29.9K
Run Growth: 1.4K
Growth Rate: 4.68%
Updated: 2023年8月18日
replicate

lmsys/vicuna-7b-v1.3

Total runs: 28.6K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年6月29日
replicate

(Research only) IP-Adapter-FaceID can generate various style images conditioned on a face with only text prompts

Total runs: 28.3K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年12月20日
replicate

Meta's Llama 2 7b Chat - GPTQ

Total runs: 20.3K
Run Growth: 100
Growth Rate: 0.49%
Updated: 2023年7月24日
replicate

sdxs-512-0.9 can generate high-resolution images in real-time based on prompt texts, trained using score distillation and feature matching

Total runs: 18.8K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2024年3月28日
replicate

Stylized Audio-Driven Single Image Talking Face Animation

Total runs: 18.6K
Run Growth: 100
Growth Rate: 0.54%
Updated: 2023年10月8日
replicate

Meta's Llama 2 13b Chat - GPTQ

Total runs: 18.5K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年7月24日
replicate

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Total runs: 17.0K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2024年1月23日
replicate

ThinkDiffusionXL is a go-to model capable of amazing photorealism that's also versatile enough to generate high-quality images across a variety of styles and subjects without needing to be a prompting genius

Total runs: 15.5K
Run Growth: 100
Growth Rate: 0.65%
Updated: 2023年11月6日
replicate

This is wizard-vicuna-13b trained with a subset of the dataset - responses that contained alignment / moralizing were removed

Total runs: 15.1K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2024年4月25日
replicate

Hyper FLUX 16-step by ByteDance

Total runs: 15.0K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2024年8月28日
replicate

Mistral-7B-v0.1 fine tuned for chat with the Dolphin dataset (an open-source implementation of Microsoft's Orca)

Total runs: 13.4K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年10月31日
replicate

Image-to-video - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

Total runs: 12.0K
Run Growth: 600
Growth Rate: 5.00%
Updated: 2023年11月23日
replicate

InterpAny-Clearer: Clearer anytime frame interpolation & Manipulated interpolation

Total runs: 11.4K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年11月29日
replicate

Segments an audio recording based on who is speaking (on A100)

Total runs: 11.4K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年7月21日
replicate

(Research only) Moondream1 is a vision language model that performs on par with models twice its size

Total runs: 10.7K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2024年1月24日
replicate

Image to Image enhancer using DemoFusion

Total runs: 10.4K
Run Growth: 100
Growth Rate: 0.96%
Updated: 2023年12月8日
replicate

Open diffusion model for high-quality video generation

Total runs: 10.3K
Run Growth: 100
Growth Rate: 0.97%
Updated: 2023年10月19日
replicate

Auto fuse a user's face onto the template image, with a similar appearance to the user

Total runs: 10.1K
Run Growth: 200
Growth Rate: 1.98%
Updated: 2023年11月15日
replicate

Segment Anything 2 (SAM2) by Meta - Automatic mask generation

Total runs: 9.0K
Run Growth: 1.2K
Growth Rate: 13.33%
Updated: 2024年7月31日
replicate

DemoFusion: Democratising High-Resolution Image Generation With No 💰

Total runs: 9.0K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年12月3日
replicate

Implementation of SDXL RealVisXL_V2.0 img2img

Total runs: 8.6K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年11月7日
replicate

Phi-3-Mini-128K-Instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets

Total runs: 8.0K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2024年4月25日
replicate

Cog wrapper for Ollama llama3:70b

Total runs: 6.6K
Run Growth: 6.5K
Growth Rate: 98.42%
Updated: 2024年7月9日
replicate

360 Panorama SDXL image with inpainted wrapping seam

Total runs: 6.3K
Run Growth: 100
Growth Rate: 1.59%
Updated: 2023年9月9日
replicate

Convert your videos to DensePose and use it with MagicAnimate

Total runs: 5.7K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年12月6日
replicate

Projection module trained to add vision capabilties to Llama 3 using SigLIP

Total runs: 5.5K
Run Growth: 100
Growth Rate: 1.82%
Updated: 2024年11月5日
replicate

Fuyu-8B is a multi-modal text and image transformer trained by Adept AI

Total runs: 4.6K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年10月20日
replicate

Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data

Total runs: 4.1K
Run Growth: 100
Growth Rate: 2.44%
Updated: 2024年2月6日
replicate

Controlnet v1.1 - Tile Version

Total runs: 4.0K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年11月27日
replicate

SDXL using DeepCache

Total runs: 3.8K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2024年1月7日
replicate

Playground v2 is a diffusion-based text-to-image generative model trained from scratch. Try out all 3 models here

Total runs: 3.6K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年12月7日
replicate

nomic-embed-text-v1 is 8192 context length text encoder that surpasses OpenAI text-embedding-ada-002 and text-embedding-3-small performance on short and long context tasks

Total runs: 3.6K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2024年2月12日
replicate

Segmind Stable Diffusion Model (SSD-1B) img2img

Total runs: 3.6K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年11月3日
replicate

A combination of ip_adapter SDv1.5 and mediapipe-face to inpaint a face

Total runs: 3.6K
Run Growth: 200
Growth Rate: 5.56%
Updated: 2023年11月15日
replicate

Phi-2 by Microsoft

Total runs: 3.5K
Run Growth: 200
Growth Rate: 5.71%
Updated: 2024年1月30日
replicate

Implementation of SDXL RealVisXL_V1.0 img2img

Total runs: 3.4K
Run Growth: 0
Growth Rate: 0.00%
Updated: 2023年11月1日
replicate

A Flux LoRA trained on watercolor style photos

Total runs: 3.2K
Run Growth: 500
Growth Rate: 15.63%
Updated: 2024年8月15日