Haochen-Wang409 / Grasp-Any-RegionLinks
Official implementation of "Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs".
☆51Updated this week
Alternatives and similar repositories for Grasp-Any-Region
Users that are interested in Grasp-Any-Region are comparing it to the libraries listed below
Sorting:
- ☆91Updated 4 months ago
- https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT☆87Updated 2 months ago
- NEO Series: Native Vision-Language Models from First Principles☆180Updated this week
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆60Updated 3 months ago
- Pixel-Level Reasoning Model trained with RL [NeuIPS25]☆243Updated last month
- Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision☆160Updated last month
- Implementation for "The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer"☆68Updated last month
- Structured Video Comprehension of Real-World Shorts☆208Updated last month
- Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning☆122Updated 2 months ago
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆34Updated 4 months ago
- ICML 2025 - Impossible Videos☆77Updated 3 months ago
- Official repository for the UAE paper, unified-GRPO, and unified-Bench☆142Updated last month
- Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (arXiv 2025)☆177Updated 2 months ago
- [arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation☆90Updated 7 months ago
- Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?☆75Updated 3 months ago
- Code and dataset link for "DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World"☆112Updated 3 weeks ago
- Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"☆291Updated 3 weeks ago
- [NeurIPS 2025] The official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tun…☆37Updated 8 months ago
- An official implementation of "CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning"☆87Updated last week
- [NeurIPS 2024] Efficient Large Multi-modal Models via Visual Context Compression☆61Updated 8 months ago
- 🔥🔥🔥 Latest Papers, Codes and Datasets on Video-LMM Post-Training☆129Updated last week
- Official Implementation of LaViDa: :A Large Diffusion Language Model for Multimodal Understanding☆162Updated this week
- ☆53Updated 3 weeks ago
- ☆160Updated 4 months ago
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆138Updated 10 months ago
- ☆61Updated 3 months ago
- Official implementation of Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning☆173Updated last month
- The code repository of UniRL☆44Updated 4 months ago
- Offical repo for CAT-V - Caption Anything in Video: Object-centric Dense Video Captioning with Spatiotemporal Multimodal Prompting☆56Updated 3 months ago
- ☆40Updated 3 months ago