zhangzjn / EMOv2Links
[T-PAMI 2025] EMOv2: Pushing 5M Vision Model Frontier
☆46Updated 7 months ago
Alternatives and similar repositories for EMOv2
Users that are interested in EMOv2 are comparing it to the libraries listed below
Sorting:
- Scaling Vision Pre-Training to 4K Resolution☆195Updated last week
- Official implementation of Add-SD: Rational Generation without Manual Reference.☆27Updated 11 months ago
- ☆90Updated 5 months ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆54Updated 3 weeks ago
- Code for the paper "Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers" [ICCV 2025]☆82Updated 2 weeks ago
- [ICCV2025] Harnessing CLIP, DINO and SAM for Open Vocabulary Segmentation☆72Updated last month
- Pytorch Implementation of "SMITE: Segment Me In TimE" (ICLR 2025)☆211Updated 4 months ago
- PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"☆20Updated 9 months ago
- ☆53Updated 3 months ago
- ☆46Updated 2 months ago
- Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding☆198Updated 6 months ago
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆29Updated 10 months ago
- An open-source implementaion for fine-tuning SmolVLM.☆42Updated 3 months ago
- Official PyTorch implementation of "No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding"☆31Updated last year
- [CVPR 2025] DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception☆70Updated 2 months ago
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆28Updated last month
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆42Updated last year
- [CVPR 2025]Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction☆126Updated 4 months ago
- we propose FlexEdit, an end-to-end image editing method that leverages both free-shape masks and language instructions for Flexible Editi…☆32Updated 11 months ago
- Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model☆68Updated 6 months ago
- Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion☆43Updated 5 months ago
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024☆60Updated 5 months ago
- CAVIS: Context-Aware Video Instance Segmentation☆88Updated last week
- [ICCV2025] Referring any person or objects given a natural language description. Code base for RexSeek and HumanRef Benchmark☆149Updated 3 months ago
- (ICCV 2025) ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations☆61Updated this week
- An open source implementation of CLIP (With TULIP Support)☆162Updated 2 months ago
- Vision Manus: Your versatile Visual AI assistant☆245Updated last week
- Official Pytorch implementation of "Vision Transformers Don't Need Trained Registers"☆83Updated last month
- A Simple Framework of Small-scale LMMs for Video Understanding☆73Updated 2 months ago
- Structured Video Comprehension of Real-World Shorts☆152Updated this week