songweii / DualToken

🌈 Unifying Visual Understanding and Generation with Dual Visual Vocabularies

☆23

Alternatives and similar repositories for DualToken:

Users that are interested in DualToken are comparing it to the libraries listed below

Haochen-Wang409 / ross
[ICLR 2025] Reconstructive Visual Instruction Tuning
☆73Updated 3 weeks ago
Cooperx521 / PyramidDrop
(CVPR 2025) PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
☆81Updated 3 weeks ago
joez17 / VideoNIAH
VideoNIAH: A Flexible Synthetic Method for Benchmarking Video MLLMs
☆46Updated 2 weeks ago
Share14 / ShareGemini
☆29Updated 7 months ago
OpenGVLab / V2PE
[ArXiv] V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
☆31Updated 3 months ago
RupertLuo / VoCoT
VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
☆49Updated 8 months ago
jiyt17 / IDA-VLM
[ICLR 2025] IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
☆27Updated 4 months ago
hmxiong / StreamChat
Official repo for "Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge" ICLR2025
☆40Updated 2 weeks ago
yu-rp / VisualPerceptionToken
☆53Updated this week
OpenGVLab / Mono-InternVL
[CVPR 2025] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
☆22Updated 2 weeks ago
foundation-multimodal-models / CAL
[NeurIPS'24] Official PyTorch Implementation of Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
☆57Updated 6 months ago
saccharomycetes / mllms_know
[ICLR'25] Official code for the paper 'MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs'
☆89Updated last week
Liuziyu77 / MIA-DPO
Official implement of MIA-DPO
☆54Updated 2 months ago
yaolinli / DeCo
☆31Updated 8 months ago
SCZwangxiao / video-ReTaKe
Official implementation of paper ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
☆29Updated last week
showlab / MovieSeq
[ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences
☆36Updated 2 weeks ago
wdrink / OpenTokenizer
☆18Updated 2 months ago
palchenli / VL-Instruction-Tuning
☆91Updated last year
yuecao0119 / MMInstruct
[SCIS 2024] The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Di…
☆48Updated 4 months ago
ywh187 / FitPrune
☆40Updated 2 months ago
longvideobench / LongVideoBench
[Neurips 24' D&B] Official Dataloader and Evaluation Scripts for LongVideoBench.
☆90Updated 8 months ago
rese1f / aurora
[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
☆86Updated 2 months ago
Hon-Wong / ByteVideoLLM
This is the official repo for ByteVideoLLM/Dynamic-VLM
☆20Updated 3 months ago
baaivision / DenseFusion
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
☆137Updated 3 months ago
TencentARC / GVT
Official code for "What Makes for Good Visual Tokenizers for Large Language Models?".
☆58Updated last year
NVlabs / QLIP
[arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
☆61Updated 3 weeks ago
hshjerry / VideoEspresso
[CVPR'25] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
☆63Updated last month
AILab-CVC / VL-GPT
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
☆86Updated 6 months ago
OpenGVLab / PVC
[CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
☆34Updated last month
pengts / VW-LMM
☆24Updated 10 months ago