magic-research / Sa2VALinks

🔥 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

☆1,249

Alternatives and similar repositories for Sa2VA

Users that are interested in Sa2VA are comparing it to the libraries listed below

Sorting:

HarborYuan / ovsam
[ECCV 2024] The official code of paper "Open-Vocabulary SAM".
☆1,002Updated last month
lxtGH / OMG-Seg
OMG-LLaVA and OMG-Seg codebase [CVPR-24 and NeurIPS-24]
☆1,321Updated 3 months ago
Mark12Ding / SAM2Long
[ICCV 2025] SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
☆511Updated last month
FoundationVision / Groma
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
☆578Updated last year
DAMO-NLP-SG / VideoRefer
[CVPR 2025] The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"
☆265Updated last week
NVlabs / EAGLE
Eagle: Frontier Vision-Language Models with Data-Centric Strategies
☆865Updated last month
FoundationVision / GLEE
[CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scale
☆1,151Updated 10 months ago
ZiyuGuo99 / Image-Generation-CoT
[CVPR 2025] The First Investigation of CoT Reasoning (RL, TTS, Reflection) in Image Generation
☆795Updated 3 months ago
shenyunhang / APE
[CVPR 2024] Aligning and Prompting Everything All at Once for Universal Visual Perception
☆588Updated last year
ShareGPT4Omni / ShareGPT4Video
[NeurIPS 2024] An official implementation of "ShareGPT4Video: Improving Video Understanding and Generation with Better Captions"
☆1,075Updated 11 months ago
CircleRadon / Osprey
[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"
☆829Updated 3 weeks ago
chongzhou96 / EdgeSAM
Official PyTorch implementation of "EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM"
☆1,058Updated 3 months ago
FoundationVision / Liquid
Liquid: Language Models are Scalable and Unified Multi-modal Generators
☆613Updated 5 months ago
Oryx-mllm / Oryx
[ICLR 2025] MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
☆324Updated 2 months ago
showlab / Show-o
[ICLR 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.
☆1,696Updated this week
om-ai-lab / OmDet
Real-time and accurate open-vocabulary end-to-end object detection
☆1,335Updated 8 months ago
hustvl / EVF-SAM
Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"
☆465Updated 5 months ago
Ola-Omni / Ola
Ola: Pushing the Frontiers of Omni-Modal Language Model
☆365Updated 3 months ago
MasterBin-IIAU / UNINEXT
[CVPR'23] Universal Instance Perception as Object Discovery and Retrieval
☆1,278Updated 2 years ago
NJU-PCALab / RAG-Diffusion
[ICCV 2025] Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement 🔥
☆601Updated 2 months ago
NVlabs / describe-anything
[ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning
☆1,323Updated 2 months ago
XueZeyue / DanceGRPO
An official implementation of DanceGRPO: Unleashing GRPO on Visual Generation
☆774Updated this week
SkyworkAI / Matrix-Game
Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model
☆1,606Updated 3 weeks ago
microsoft / VidTok
a family of versatile and state-of-the-art video tokenizers.
☆412Updated 2 weeks ago
baaivision / See3D
[CVPR'25 Highlight] You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
☆685Updated 5 months ago
alibaba / Tora
[CVPR'25]Tora: Trajectory-oriented Diffusion Transformer for Video Generation
☆1,198Updated 2 months ago
yoxu515 / aot-benchmark
An efficient modular implementation of Associating Objects with Transformers for Video Object Segmentation in PyTorch
☆572Updated last year
Vchitect / Vchitect-2.0
Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models
☆914Updated 5 months ago
qqlu / Entity
EntitySeg Toolbox: Towards Open-World and High-Quality Image Segmentation
☆1,032Updated last year
AlaaLab / InstructCV
[ ICLR 2024 ] Official Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"
☆463Updated last year