2toinf / IVM
[NeurIPS-2024] The offical Implementation of "Instruction-Guided Visual Masking"
☆33Updated 5 months ago
Alternatives and similar repositories for IVM:
Users that are interested in IVM are comparing it to the libraries listed below
- ☆69Updated 4 months ago
- Egocentric Video Understanding Dataset (EVUD)☆29Updated 9 months ago
- ☆48Updated last year
- ☆40Updated 3 months ago
- [NeurIPS2024] Official code for (IMA) Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs☆18Updated 6 months ago
- [ECCV 2024] AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation☆33Updated 7 months ago
- ☆27Updated 3 weeks ago
- ☆46Updated 4 months ago
- ☆75Updated 3 weeks ago
- Can 3D Vision-Language Models Truly Understand Natural Language?☆21Updated last year
- Evaluate Multimodal LLMs as Embodied Agents☆44Updated 2 months ago
- [ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision☆60Updated 9 months ago
- [CVPR 2025] Official PyTorch Implementation of GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmenta…☆34Updated last week
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models☆31Updated 5 months ago
- [NeurIPS 2024] Official Repository of Multi-Object Hallucination in Vision-Language Models☆28Updated 5 months ago
- [CVPR 2025 (Oral)] Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key☆48Updated 2 weeks ago
- Multimodal RewardBench☆38Updated 2 months ago
- ☆28Updated 3 months ago
- Latent Motion Token as the Bridging Language for Robot Manipulation☆81Updated last month
- VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning☆22Updated last week
- Awesome paper for multi-modal llm with grounding ability☆17Updated 8 months ago
- Official repo of EmbodiedBench, a comprehensive benchmark designed to evaluate MLLMs as embodied agents.☆109Updated 2 weeks ago
- ☆24Updated 5 months ago
- Official repo of the ICLR 2025 paper "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos"☆25Updated 7 months ago
- Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision☆40Updated last month
- [arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation☆69Updated last month
- Spatial-R1: The first MLLM trained using GRPO for spatial reasoning in videos☆25Updated last week
- IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks☆58Updated 7 months ago
- [ICML 2024] The offical Implementation of "DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning"☆80Updated 7 months ago
- [CVPR2024] This is the official implement of MP5☆99Updated 9 months ago