TencentARC / TokLIPLinks

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

☆234

Alternatives and similar repositories for TokLIP

Users that are interested in TokLIP are comparing it to the libraries listed below

Sorting:

PKU-YuanGroup / WISE
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
☆168Updated last month
Tencent / HaploVLM
ICML2025
☆61Updated 3 months ago
PhoenixZ810 / RISEBench
[NIPS 2025 DB Oral] Official Repository of paper: Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
☆120Updated 2 weeks ago
wusize / Harmon
[ICCV2025]Code Release of Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
☆178Updated 6 months ago
Haochen-Wang409 / ross
[ICLR'25] Reconstructive Visual Instruction Tuning
☆129Updated 8 months ago
zhang9302002 / ThinkingWithVideos
The official code of "Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning"
☆68Updated last month
Cooperx521 / ScaleCap
Official repository of 'ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing’
☆58Updated 5 months ago
rese1f / aurora
[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
☆133Updated 6 months ago
ByteVisionLab / TokenFlow
[CVPR 2025] 🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".
☆408Updated 4 months ago
JoeLeelyf / OVO-Bench
[CVPR 2025] OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
☆105Updated 4 months ago
gogoduan / GoT-R1
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
☆100Updated 6 months ago
Video-Reason / Awesome-Video-Reasoning
This is a collection of recent papers on reasoning in video generation models.
☆66Updated last week
rongyaofang / PUMA
Empowering Unified MLLM with Multi-granular Visual Generation
☆130Updated 10 months ago
hmxiong / StreamChat
Official repo for "Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge" ICLR2025
☆89Updated 8 months ago
Cooperx521 / PyramidDrop
(CVPR 2025) PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
☆134Updated 9 months ago
zhijie-group / Orthus
☆64Updated 6 months ago
TencentARC / MindOmni
☆135Updated last month
PKU-YuanGroup / UAE
Official repository for the UAE paper, unified-GRPO, and unified-Bench
☆150Updated 2 months ago
Wang-Xiaodong1899 / CVPR25-MLLM-Paper-List
🔥CVPR 2025 Multimodal Large Language Models Paper List
☆154Updated 8 months ago
wusize / OpenUni
☆164Updated 5 months ago
yunlong10 / Awesome-Video-LMM-Post-Training
🔥🔥🔥 Latest Papers, Codes and Datasets on Video-LMM Post-Training
☆186Updated 2 weeks ago
SxJyJay / UniToken
[CVPRW 2025] UniToken is an auto-regressive generation model that combines discrete and continuous representations to process visual inpu…
☆97Updated 7 months ago
Yxxxb / VoCo-LLaMA
[CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".
☆195Updated 5 months ago
ncTimTang / AKS
[CVPR 2025] Adaptive Keyframe Sampling for Long Video Understanding
☆145Updated 3 months ago
hu-zijing / B2-DiffuRL
[CVPR 25] A framework named B^2-DiffuRL for RL-based diffusion model fine-tuning.
☆49Updated 8 months ago
facebookresearch / metamorph
Code for MetaMorph Multimodal Understanding and Generation via Instruction Tuning
☆222Updated 7 months ago
Purshow / Awesome-Unified-Multimodal
📖 This is a repository for organizing papers, codes, and other resources related to unified multimodal models.
☆334Updated last month
csuhan / Tar
[NeurIPS 2025] Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
☆188Updated 2 months ago
sming256 / BOLT
[CVPR2025] BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
☆33Updated 8 months ago
hshjerry / VideoEspresso
[CVPR 2025 Oral] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
☆129Updated 4 months ago