magic-research / Sa2VALinks
π₯ Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
β1,167Updated last week
Alternatives and similar repositories for Sa2VA
Users that are interested in Sa2VA are comparing it to the libraries listed below
Sorting:
- [ICCV 2025] SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Treeβ490Updated this week
- [ECCV 2024] The official code of paper "Open-Vocabulary SAM".β979Updated 11 months ago
- OMG-LLaVA and OMG-Seg codebase [CVPR-24 and NeurIPS-24]β1,305Updated last month
- [ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenizationβ572Updated last year
- [CVPR 2025] The First Investigation of CoT Reasoning (RL, TTS, Reflection) in Image Generationβ763Updated last month
- [CVPR 2024] Aligning and Prompting Everything All at Once for Universal Visual Perceptionβ574Updated last year
- [CVPR 2025] The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"β235Updated 3 weeks ago
- [CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scaleβ1,140Updated 8 months ago
- [NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captionsβ1,067Updated 9 months ago
- Liquid: Language Models are Scalable and Unified Multi-modal Generatorsβ600Updated 3 months ago
- Eagle Family: Exploring Model Designs, Data Recipes and Training Strategies for Frontier-Class Multimodal LLMsβ821Updated 2 months ago
- [ICLR 2025] MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolutionβ312Updated last week
- [CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"β822Updated 2 months ago
- Ola: Pushing the Frontiers of Omni-Modal Language Modelβ347Updated last month
- Real-time and accurate open-vocabulary end-to-end object detectionβ1,330Updated 6 months ago
- [CVPR'25 Highlight] You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scaleβ677Updated 2 months ago
- a family of versatile and state-of-the-art video tokenizers.β403Updated 3 months ago
- [ICLR 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.β1,568Updated last week
- [ICCV 2025] Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement π₯β585Updated 2 weeks ago
- [CVPR'23] Universal Instance Perception as Object Discovery and Retrievalβ1,274Updated last year
- Official PyTorch implementation of "EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM"β1,038Updated last month
- Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"β429Updated 3 months ago
- [CVPR'25]Tora: Trajectory-oriented Diffusion Transformer for Video Generationβ1,172Updated this week
- [ ICLR 2024 ] Official Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"β462Updated last year
- Official repository of T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoTβ350Updated 3 weeks ago
- NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editingβ554Updated 8 months ago
- Video-Inpaint-Anything: This is the inference code for our paper CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Cβ¦β298Updated 9 months ago
- Allegro is a powerful text-to-video model that generates high-quality videos up to 6 seconds at 15 FPS and 720p resolution from simple teβ¦β1,087Updated 5 months ago
- An efficient modular implementation of Associating Objects with Transformers for Video Object Segmentation in PyTorchβ572Updated last year
- Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"β452Updated last month