magic-research / Sa2VALinks
š„ Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
ā1,201Updated this week
Alternatives and similar repositories for Sa2VA
Users that are interested in Sa2VA are comparing it to the libraries listed below
Sorting:
- [ECCV 2024] The official code of paper "Open-Vocabulary SAM".ā986Updated last year
- [ICCV 2025] SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Treeā500Updated this week
- OMG-LLaVA and OMG-Seg codebase [CVPR-24 and NeurIPS-24]ā1,314Updated 2 months ago
- [ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenizationā577Updated last year
- [CVPR 2025] The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"ā248Updated last month
- [CVPR 2024] Aligning and Prompting Everything All at Once for Universal Visual Perceptionā578Updated last year
- Eagle Family: Exploring Model Designs, Data Recipes and Training Strategies for Frontier-Class Multimodal LLMsā841Updated 3 months ago
- [CVPR 2025] The First Investigation of CoT Reasoning (RL, TTS, Reflection) in Image Generationā772Updated 2 months ago
- [CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scaleā1,141Updated 9 months ago
- [NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captionsā1,073Updated 9 months ago
- [CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"ā826Updated 3 months ago
- Liquid: Language Models are Scalable and Unified Multi-modal Generatorsā607Updated 3 months ago
- Official PyTorch implementation of "EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM"ā1,047Updated 2 months ago
- [ICLR 2025] MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolutionā318Updated 3 weeks ago
- NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editingā558Updated 9 months ago
- Real-time and accurate open-vocabulary end-to-end object detectionā1,334Updated 7 months ago
- Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"ā437Updated 4 months ago
- Ola: Pushing the Frontiers of Omni-Modal Language Modelā352Updated last month
- [CVPR'25 Highlight] You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scaleā678Updated 3 months ago
- [ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioningā1,289Updated last month
- [ECCV 2024] Tokenize Anything via Promptingā587Updated 7 months ago
- [ICLR 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.ā1,625Updated this week
- a family of versatile and state-of-the-art video tokenizers.ā406Updated 3 months ago
- Video-Inpaint-Anything: This is the inference code for our paper CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Cā¦ā299Updated 10 months ago
- Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"ā474Updated last month
- [ ICLR 2024 ] Official Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"ā462Updated last year
- Official implementation of OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusionā349Updated 4 months ago
- [CVPR'23] Universal Instance Perception as Object Discovery and Retrievalā1,275Updated 2 years ago
- [CVPR'25]Tora: Trajectory-oriented Diffusion Transformer for Video Generationā1,193Updated 3 weeks ago
- [ICCV 2025] Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement š„ā591Updated last month