ByteDance-Seed / SAILLinks

Implementation for "The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer"

☆73

Alternatives and similar repositories for SAIL

Users that are interested in SAIL are comparing it to the libraries listed below

Sorting:

TencentARC / SEED-Bench-R1
☆94Updated 5 months ago
NVlabs / QLIP
[arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
☆94Updated 9 months ago
RenShuhuai-Andy / NBP
Official implementation of Next Block Prediction: Video Generation via Semi-Autoregressive Modeling
☆40Updated 9 months ago
MengLcool / DeepStack-VL
[NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…
☆72Updated last year
TencentARC / GRPO-CARE
☆79Updated 5 months ago
OpenGVLab / TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
☆62Updated 4 months ago
OpenGVLab / PVC
[CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
☆49Updated 5 months ago
Fr0zenCrane / UniCoT
Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
☆173Updated last week
rese1f / aurora
[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
☆133Updated 6 months ago
TencentARC / Video-Holmes
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
☆78Updated 4 months ago
TencentARC / MindOmni
☆135Updated last month
alibaba / conv-llava
☆123Updated last year
OpenGVLab / MMIU
[ICLR2025] MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
☆90Updated last year
lxtGH / DenseWorld-1M
Code and dataset link for "DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World"
☆116Updated 2 months ago
baaivision / DenseFusion
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
☆158Updated 11 months ago
om-ai-lab / ZoomEye
[EMNLP-2025 Oral] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
☆62Updated 2 weeks ago
rongyaofang / PUMA
Empowering Unified MLLM with Multi-granular Visual Generation
☆131Updated 10 months ago
inclusionAI / Ming-UniVision
Code release for Ming-UniVision: Joint Image Understanding and Geneation with a Continuous Unified Tokenizer
☆126Updated last month
multimodal-reasoning-lab / Bagel-Zebra-CoT
https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT
☆103Updated last month
inst-it / inst-it
[NeurIPS 2025] The official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tun…
☆38Updated 9 months ago
Haochen-Wang409 / ross
[ICLR'25] Reconstructive Visual Instruction Tuning
☆128Updated 7 months ago
yfzhang114 / SliME
✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
☆163Updated 11 months ago
Hon-Wong / ByteVideoLLM
[ICCV 2025] Dynamic-VLM
☆26Updated 11 months ago
kxfan2002 / SophiaVL-R1
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
☆87Updated 3 months ago
Gen-Verse / HermesFlow
[NeurIPS 2025] HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
☆72Updated 2 months ago
OpenGVLab / Mono-InternVL
[CVPR 2025] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
☆96Updated 4 months ago
EvolvingLMMs-Lab / OpenMMReasoner
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
☆111Updated this week
AILab-CVC / VL-GPT
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
☆86Updated last year
facebookresearch / multimodal_rewardbench
Multimodal RewardBench
☆55Updated 9 months ago
yeezhu / UNIT
PyTorch implementation of "UNIT: Unifying Image and Text Recognition in One Vision Encoder", NeurlPS 2024.
☆31Updated last year