Haochen-Wang409 / Grasp-Any-RegionLinks
Official implementation of "Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs".
☆92Updated last week
Alternatives and similar repositories for Grasp-Any-Region
Users that are interested in Grasp-Any-Region are comparing it to the libraries listed below
Sorting:
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆61Updated 3 months ago
- NEO Series: Native Vision-Language Models from First Principles☆223Updated 3 weeks ago
- Official implementation of "PyVision: Agentic Vision with Dynamic Tooling."☆133Updated 3 months ago
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆36Updated 5 months ago
- [ACL2025 Oral & Award] Evaluate Image/Video Generation like Humans - Fast, Explainable, Flexible☆107Updated 3 months ago
- ☆62Updated 4 months ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆42Updated last year
- Official implementation of "Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence"☆116Updated this week
- An official implementation of "CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning"☆140Updated last week
- Implementation for "The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer"☆70Updated 2 weeks ago
- [NeurIPS 2025] Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation, arXiv 2024☆64Updated last month
- High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning☆51Updated 3 months ago
- https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT☆98Updated 2 weeks ago
- OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement☆118Updated 3 months ago
- ☆56Updated 6 months ago
- ☆94Updated 4 months ago
- Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision☆167Updated last week
- [NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration☆24Updated last year
- Official Implementation of LaViDa: :A Large Diffusion Language Model for Multimodal Understanding☆170Updated 3 weeks ago
- The official repository of "R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Integration"☆121Updated 2 months ago
- Official implementation of Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents (NeurIPS 2025)☆43Updated last month
- The SAIL-VL2 series model developed by the BytedanceDouyinContent Group☆75Updated last month
- [arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation☆93Updated 8 months ago
- ☆132Updated last month
- Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?☆77Updated 4 months ago
- ☆39Updated 5 months ago
- Video-LlaVA fine-tune for CinePile evaluation☆51Updated last year
- Quick Long Video Understanding☆68Updated 2 weeks ago
- PhysGame Benchmark for Physical Commonsense Evaluation in Gameplay Videos☆46Updated 4 months ago
- Structured Video Comprehension of Real-World Shorts☆215Updated last month