allenai / SAGELinks
[arXiv 2025] SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning
☆61Updated last month
Alternatives and similar repositories for SAGE
Users that are interested in SAGE are comparing it to the libraries listed below
Sorting:
- [ICLR'26] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs☆97Updated 2 weeks ago
- [NeurIPS 2025] Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation☆70Updated 3 months ago
- InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models☆84Updated last week
- [MTI-LLM@NeurIPS 2025] Official implementation of "PyVision: Agentic Vision with Dynamic Tooling."☆147Updated 6 months ago
- The official repository of "R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Integration"☆136Updated 5 months ago
- Official codes of "Monet: Reasoning in Latent Visual Space Beyond Image and Language"☆125Updated last week
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆64Updated 6 months ago
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆39Updated 7 months ago
- ☆63Updated 7 months ago
- OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement☆129Updated 6 months ago
- Official Repository for paper "HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding"☆57Updated 3 weeks ago
- [EMNLP-2025 Oral] ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration☆72Updated 2 months ago
- [ACL2025 Oral & Award] Evaluate Image/Video Generation like Humans - Fast, Explainable, Flexible☆121Updated 6 months ago
- ☆68Updated 4 months ago
- [ICLR 2026] Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play.☆115Updated last week
- DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models☆169Updated last month
- PhysGame Benchmark for Physical Commonsense Evaluation in Gameplay Videos☆47Updated 7 months ago
- Official implementation of "Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence"☆129Updated last month
- PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"☆21Updated last year
- Official PyTorch implementation of TokenSet.☆127Updated 10 months ago
- [ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision☆72Updated last year
- [ArXiv 2025] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models☆128Updated last month
- ☆517Updated 2 weeks ago
- An open source implementation of CLIP (With TULIP Support)☆165Updated 8 months ago
- NextFlow🚀: Unified Sequential Modeling Activates Multimodal Understanding and Generation☆309Updated last month
- Step3-VL-10B: A compact yet frontier multimodal model achieving SOTA performance at the 10B scale, matching open-source models 10-20x its…☆390Updated 3 weeks ago
- X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains☆50Updated last week
- [CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction☆168Updated 10 months ago
- This repository collects and organises state‑of‑the‑art papers on spatial reasoning for Multimodal Vision–Language Models (MVLMs).☆278Updated this week
- ☆30Updated 3 weeks ago