Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
We introduce Divot, a
Di
ffusion-Powered
V
ide
o
T
okenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations.
Building upon the Divot tokenizer, we present
Divot-LLM
through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model.
All models, training code and inference code are released!
TODOs
Release the pretrained tokenizer and de-tokenizer of Divot.
Release the pretrained and instruction tuned model of Divot-LLM.
Release inference code of Divot.
Release training and inference code of Divot-LLM.
Release training code of Divot.
Release de-tokenizer adaptation training code.
Introduction
We utilize the diffusion procedure to learn
a video tokenizer
in a self-supervised manner for unified comprehension and
generation, where the spatiotemporal representations serve as the
condition of a diffusion model to de-noise video clips. Additionally,
the proxy diffusion model functions as a
de-tokenizer
to decode
realistic video clips from the video representations.
After training the the Divot tokenizer, video features from the Divot tokenizer are fed into the LLM to perform next-word prediction for video comprehension, while learnable queries are input into the LLM to model the distributions of Divot features using
a Gaussian Mixture Model (GMM)
for video generation. During inference,
video features are sampled from the predicted GMM distribution to
decode videos using the de-tokenizer.
git clone https://github.com/TencentARC/Divot.git
cd Divot
pip install -r requirements.txt
Model Weights
We release the pretrained tokenizer and de-tokenizer, pre-trained and instruction-tuned Divot-LLM. Please download the checkpoints and save them under the folder
./pretrained
. For example,
./pretrained/Divot_tokenizer_detokenizer
.
Prepare the training data in the format of webdataset.
Run the following script.
sh scripts/train_Divot_pretrain_comp_gen.sh
Instruction-tuning
Download the checkpoints of pre-trained Divot tokenizer and Divot-LLM in
Divot
, and save them under the folder
./pretrained
.
Prepare the instruction data in the format of webdataset (for generation) and jsonl (for comprehension, where each line stores a dictionary used to specify the video_path, question, and answer).
Run the following script.
### For video comprehension
sh scripts/train_Divot_sft_comp.sh
### For video generation
sh scripts/train_Divot_sft_gen.sh
Inference with your own model
Obtain "pytorch_model.bin" with the following script.
cd train_output/sft_comp/checkpoint-xxxx
python3 zero_to_fp32.py . pytorch_model.bin
Merge your trained lora with the original LLM model using the following script.
python3 src/tools/merge_agent_lora_weight.py
Load your merged model in "mistral7b_merged_xxx" and and corresponding "agent" path, For example,
Divot huggingface.co is an AI model on huggingface.co that provides Divot's model effect (), which can be used instantly with this TencentARC Divot model. huggingface.co supports a free trial of the Divot model, and also provides paid use of the Divot. Support call Divot model through api, including Node.js, Python, http.
Divot huggingface.co is an online trial and call api platform, which integrates Divot's modeling effects, including api services, and provides a free online trial of Divot, you can try Divot online for free by clicking the link below.
TencentARC Divot online free url in huggingface.co:
Divot is an open source model from GitHub that offers a free installation service, and any user can find Divot on GitHub to install. At the same time, huggingface.co provides the effect of Divot install, users can directly use Divot installed effect in huggingface.co for debugging and trial. It also supports api for free installation.