zhangzjn / EMOv2Links
EMOv2: Pushing 5M Vision Model Frontier
☆46Updated 6 months ago
Alternatives and similar repositories for EMOv2
Users that are interested in EMOv2 are comparing it to the libraries listed below
Sorting:
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆51Updated 6 months ago
- ☆85Updated 4 months ago
- we propose FlexEdit, an end-to-end image editing method that leverages both free-shape masks and language instructions for Flexible Editi…☆33Updated 10 months ago
- [ICCV2025] Harnessing CLIP, DINO and SAM for Open Vocabulary Segmentation☆63Updated 3 weeks ago
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆29Updated 9 months ago
- ☆45Updated 2 months ago
- Scaling Vision Pre-Training to 4K Resolution☆190Updated last month
- ☆53Updated 2 months ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆42Updated 11 months ago
- Official implementation of Add-SD: Rational Generation without Manual Reference.☆27Updated 10 months ago
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆27Updated last month
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024☆60Updated 4 months ago
- [CVPR 2025] DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception☆66Updated last month
- Project for "LaSagnA: Language-based Segmentation Assistant for Complex Queries".☆57Updated last year
- Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model☆105Updated last week
- Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model☆67Updated 6 months ago
- Make Your Training Flexible: Towards Deployment-Efficient Video Models☆30Updated last month
- Official PyTorch implementation of "No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding"☆33Updated last year
- PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"☆20Updated 8 months ago
- Code for the paper "Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers" [ICCV 2025]☆73Updated 3 weeks ago
- [ICCV 2025] Official implementation of LLaVA-KD: A Framework of Distilling Multimodal Large Language Models☆87Updated 2 weeks ago
- [NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration☆24Updated 9 months ago
- ☆22Updated 3 months ago
- Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines☆125Updated 8 months ago
- [IJCV 2024] MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation☆123Updated 9 months ago
- [IEEE TCSVT] Official Pytorch Implementation of CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation.☆43Updated 6 months ago
- FaceXBench: Evaluating Multimodal LLMs on Face Understanding☆14Updated 5 months ago
- This repo is the official implementation of iSeg: An Iterative Refinement-based Framework for Training-free Segmentation.☆37Updated 7 months ago
- The official implement of "VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning"☆226Updated this week
- Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding☆194Updated 5 months ago