magic-research / Sa2VALinks
π₯ Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
β1,103Updated last week
Alternatives and similar repositories for Sa2VA
Users that are interested in Sa2VA are comparing it to the libraries listed below
Sorting:
- [ECCV 2024] The official code of paper "Open-Vocabulary SAM".β970Updated 10 months ago
- SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Treeβ475Updated 3 weeks ago
- OMG-LLaVA and OMG-Seg codebase [CVPR-24 and NeurIPS-24]β1,293Updated 5 months ago
- [CVPR 2024] Aligning and Prompting Everything All at Once for Universal Visual Perceptionβ566Updated last year
- [ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenizationβ565Updated 11 months ago
- Eagle Family: Exploring Model Designs, Data Recipes and Training Strategies for Frontier-Class Multimodal LLMsβ779Updated last month
- [NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captionsβ1,057Updated 7 months ago
- Liquid: Language Models are Scalable and Unified Multi-modal Generatorsβ587Updated last month
- Real-time and accurate open-vocabulary end-to-end object detectionβ1,319Updated 5 months ago
- [CVPR 2025] The First Investigation of CoT Reasoning (RL, TTS, Reflection) in Image Generationβ699Updated last week
- [CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scaleβ1,128Updated 7 months ago
- [CVPR 2025] The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"β204Updated 3 weeks ago
- Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"β411Updated 2 months ago
- [CVPR'25 Highlight] You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scaleβ660Updated last month
- [ICLR 2025] Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.β1,417Updated last month
- [ICLR 2025] MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolutionβ308Updated 3 months ago
- Official PyTorch implementation of "EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM"β1,016Updated last week
- Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement π₯β573Updated 4 months ago
- [ECCV 2024] Tokenize Anything via Promptingβ582Updated 5 months ago
- [CVPR'23] Universal Instance Perception as Object Discovery and Retrievalβ1,271Updated last year
- Video-Inpaint-Anything: This is the inference code for our paper CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Cβ¦β293Updated 8 months ago
- Ola: Pushing the Frontiers of Omni-Modal Language Modelβ337Updated 3 months ago
- Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"β394Updated this week
- [CVPR 2024 π₯] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses thaβ¦β883Updated 6 months ago
- Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Modelsβ911Updated 2 months ago
- Official implementation of OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusionβ325Updated 2 months ago
- a family of versatile and state-of-the-art video tokenizers.β391Updated last month
- [CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"β819Updated last month
- [ ICLR 2024 ] Official Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"β460Updated last year
- An efficient modular implementation of Associating Objects with Transformers for Video Object Segmentation in PyTorchβ570Updated last year