mPLUG-Owl3 is a state-of-the-art multi-modal large language model designed to tackle the challenges of long image sequence understanding. We propose Hyper Attention, which boosts the speed of long visual sequence understanding in multimodal large language models by sixfold, allowing for processing of visual sequences that are eight times longer. Meanwhile, we maintain excellent performance on single-image, multi-image, and video tasks.
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
from decord import VideoReader, cpu # pip install decord
model_path = 'mPLUG/mPLUG-Owl3-7B-240728'
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)
messages = [
{"role": "user", "content": """<|video|>Describe this video."""},
{"role": "assistant", "content": ""}
]
videos = ['/nas-mmu-data/examples/car_room.mp4']
MAX_NUM_FRAMES=16defencode_video(video_path):
defuniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i inrange(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1) # FPS
frame_idx = [i for i inrange(0, len(vr), sample_fps)]
iflen(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)
inputs.to('cuda')
inputs.update({
'tokenizer': tokenizer,
'max_new_tokens':100,
'decode_text':True,
})
g = model.generate(**inputs)
print(g)
Citation
If you find our work helpful, feel free to give us a cite.
@misc{ye2024mplugowl3longimagesequenceunderstanding,
title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models},
author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
year={2024},
eprint={2408.04840},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.04840},
}
Runs of mPLUG mPLUG-Owl3-7B-240728 on huggingface.co
16.9K
Total runs
83
24-hour runs
176
3-day runs
2.6K
7-day runs
14.9K
30-day runs
More Information About mPLUG-Owl3-7B-240728 huggingface.co Model
mPLUG-Owl3-7B-240728 huggingface.co is an AI model on huggingface.co that provides mPLUG-Owl3-7B-240728's model effect (), which can be used instantly with this mPLUG mPLUG-Owl3-7B-240728 model. huggingface.co supports a free trial of the mPLUG-Owl3-7B-240728 model, and also provides paid use of the mPLUG-Owl3-7B-240728. Support call mPLUG-Owl3-7B-240728 model through api, including Node.js, Python, http.
mPLUG-Owl3-7B-240728 huggingface.co is an online trial and call api platform, which integrates mPLUG-Owl3-7B-240728's modeling effects, including api services, and provides a free online trial of mPLUG-Owl3-7B-240728, you can try mPLUG-Owl3-7B-240728 online for free by clicking the link below.
mPLUG mPLUG-Owl3-7B-240728 online free url in huggingface.co:
mPLUG-Owl3-7B-240728 is an open source model from GitHub that offers a free installation service, and any user can find mPLUG-Owl3-7B-240728 on GitHub to install. At the same time, huggingface.co provides the effect of mPLUG-Owl3-7B-240728 install, users can directly use mPLUG-Owl3-7B-240728 installed effect in huggingface.co for debugging and trial. It also supports api for free installation.
mPLUG-Owl3-7B-240728 install url in huggingface.co: