path2generalist / General-LevelLinks

On Path to Multimodal Generalist: General-Level and General-Bench

☆19

Alternatives and similar repositories for General-Level

Users that are interested in General-Level are comparing it to the libraries listed below

Sorting:

TencentARC / Video-Holmes
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
☆77Updated 4 months ago
JaaackHongggg / WorldSense
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
☆33Updated last month
TencentARC / GRPO-CARE
☆78Updated 4 months ago
Cooperx521 / ScaleCap
Official repository of 'ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing’
☆57Updated 4 months ago
inst-it / inst-it
[NeurIPS 2025] The official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tun…
☆37Updated 8 months ago
Liuziyu77 / MIA-DPO
Official implement of MIA-DPO
☆67Updated 9 months ago
inclusionAI / M2-Reasoning
M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning
☆46Updated 4 months ago
Hon-Wong / ByteVideoLLM
[ICCV 2025] Dynamic-VLM
☆26Updated 11 months ago
zhijie-group / UniCMs
☆39Updated 6 months ago
TIGER-AI-Lab / VISTA
The code for "VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by VIdeo SpatioTemporal Augmentation" [CVPR2025]
☆20Updated 8 months ago
markywg / transagent
[NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration
☆24Updated last year
OpenGVLab / PVC
[CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
☆50Updated 5 months ago
Gen-Verse / HermesFlow
[NeurIPS 2025] HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
☆71Updated 2 months ago
MengLcool / DeepStack-VL
[NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…
☆66Updated last year
360CVGroup / Inner-Adaptor-Architecture
LMM solved catastrophic forgetting, AAAI2025
☆44Updated 7 months ago
TencentARC / MindOmni
☆132Updated last month
HaozheZhao / MENTOR
☆30Updated 4 months ago
MME-Benchmarks / MME-Unify
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
☆41Updated 7 months ago
yliu-cs / PiTe
[ECCV'24 Oral] PiTe: Pixel-Temporal Alignment for Large Video-Language Model
☆17Updated 9 months ago
qishisuren123 / AnyCap
A unified framework for controllable caption generation across images, videos, and audio. Supports multi-modal inputs and customizable ca…
☆52Updated 3 months ago
Dongping-Chen / ISG
(ICLR 2025 Spotlight) Official code repository for Interleaved Scene Graph.
☆31Updated 3 months ago
zehanwang01 / OmniBind
☆33Updated 7 months ago
Vision-CAIR / Infinibench
Official InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows
☆18Updated 2 weeks ago
EvolvingLMMs-Lab / VideoMMMU
☆61Updated 2 months ago
HumanMLLM / ViSpeak
(ICCV2025) Official repository of paper "ViSpeak: Visual Instruction Feedback in Streaming Videos"
☆40Updated 4 months ago
TIGER-AI-Lab / QuickVideo
Quick Long Video Understanding
☆69Updated 3 weeks ago
multimodal-reasoning-lab / Bagel-Zebra-CoT
https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT
☆98Updated 2 weeks ago
EvolvingLMMs-Lab / MGPO
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
☆51Updated 3 months ago
om-ai-lab / ZoomEye
[EMNLP-2025 Oral] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
☆61Updated 2 months ago
TencentARC / SEED-Bench-R1
☆94Updated 4 months ago