hustvl / MaTVLMLinks
☆42Updated 3 weeks ago
Alternatives and similar repositories for MaTVLM
Users that are interested in MaTVLM are comparing it to the libraries listed below
Sorting:
- The first decoder-only multimodal state space model☆91Updated 2 weeks ago
- ☆59Updated 2 weeks ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆51Updated 5 months ago
- ☆11Updated 6 months ago
- [IJCV 2024]☆16Updated 6 months ago
- Project for "LaSagnA: Language-based Segmentation Assistant for Complex Queries".☆56Updated last year
- ☆30Updated 4 months ago
- ☆36Updated last month
- ☆16Updated last year
- ☆17Updated last month
- ☆81Updated 2 months ago
- Harnessing CLIP, DINO and SAM for Open Vocabulary Segmentation☆58Updated 3 months ago
- [CVPR 2025] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training☆46Updated 2 months ago
- ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO☆58Updated last week
- [CVPR 2025] Official PyTorch Implementation of GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmenta…☆39Updated last month
- [CVPR 2025] Official repository of the paper "Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation"☆92Updated 2 weeks ago
- Code for "AVG-LLaVA: A Multimodal Large Model with Adaptive Visual Granularity"☆28Updated 7 months ago
- [arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation☆74Updated 3 months ago
- This is the official repo for ByteVideoLLM/Dynamic-VLM☆20Updated 5 months ago
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆27Updated last month
- This repository provides an improved LLamaGen Model, fine-tuned on 500,000 high-quality images, each accompanied by over 300 token prompt…☆30Updated 7 months ago
- [IJCV 2025] MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning☆68Updated last week
- [IEEE TCSVT] Official Pytorch Implementation of CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation.☆43Updated 5 months ago
- Pixel-Level Reasoning Model trained with RL☆92Updated this week
- [CVPR 2025] DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception☆55Updated 2 weeks ago
- Multi-SpatialMLLM Multi-Frame Spatial Understanding with Multi-Modal Large Language Models☆105Updated last week
- [CVPR 2025] DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution☆50Updated 3 months ago
- [CVPR 2025] DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention☆166Updated 3 months ago
- Official repository of "CoMP: Continual Multimodal Pre-training for Vision Foundation Models"☆26Updated 2 months ago
- SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding☆41Updated this week