baaivision / DIVA
[ICLR 2025] Diffusion Feedback Helps CLIP See Better
☆254Updated 3 weeks ago
Alternatives and similar repositories for DIVA:
Users that are interested in DIVA are comparing it to the libraries listed below
- DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception☆133Updated 2 months ago
- [CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"☆197Updated 8 months ago
- [CVPR 24] The repository provides code for running inference and training for "Segment and Caption Anything" (SCA) , links for downloadin…☆211Updated 4 months ago
- EVE Series: Encoder-Free Vision-Language Models from BAAI☆290Updated this week
- Official repository for paper MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning(https://arxiv.org/abs/2406.17770).☆152Updated 4 months ago
- 🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".☆249Updated last month
- [ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions☆193Updated 7 months ago
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts☆310Updated 6 months ago
- Implements VAR+CLIP for text-to-image (T2I) generation☆119Updated 3 weeks ago
- PixelLM is an effective and efficient LMM for pixel-level reasoning and understanding. PixelLM is accepted by CVPR 2024.☆204Updated this week
- 【NeurIPS 2024】Dense Connector for MLLMs☆156Updated 4 months ago
- [NeurIPS2024] Repo for the paper `ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models'☆141Updated 3 weeks ago
- 📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.☆364Updated 3 weeks ago
- ☆134Updated last month
- Official implementation of the Law of Vision Representation in MLLMs☆149Updated 2 months ago
- [Neurips 2023 & TPAMI] T2I-CompBench (++) for Compositional Text-to-image Generation Evaluation☆233Updated 2 weeks ago
- ☆308Updated last year
- ☆110Updated 6 months ago
- [ECCV24] VISA: Reasoning Video Object Segmentation via Large Language Model☆158Updated 6 months ago
- [ECCV 2024] Official PyTorch implementation of DreamLIP: Language-Image Pre-training with Long Captions☆123Updated 2 months ago
- [NeurIPS 2024] Classification Done Right for Vision-Language Pre-Training☆200Updated last month
- When do we not need larger vision models?☆364Updated last week
- ☆121Updated 7 months ago
- Official code of SmartEdit [CVPR-2024 Highlight]☆291Updated 7 months ago
- [ICLR 2025] Autoregressive Video Generation without Vector Quantization☆382Updated last week
- [ICLR2025] LLaVA-HR: High-Resolution Large Language-Vision Assistant☆226Updated 6 months ago
- Pytorch code for paper From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models☆192Updated last month
- [NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context☆144Updated 4 months ago
- [Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought …☆227Updated last month
- VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling☆298Updated this week