baaivision / DIVALinks
[ICLR 2025] Diffusion Feedback Helps CLIP See Better
β299Updated last year
Alternatives and similar repositories for DIVA
Users that are interested in DIVA are comparing it to the libraries listed below
Sorting:
- [ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captionsβ248Updated last year
- [CVPR 2025] π₯ Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".β433Updated 6 months ago
- DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perceptionβ159Updated last year
- EVE Series: Encoder-Free Vision-Language Models from BAAIβ367Updated 6 months ago
- γNeurIPS 2024γDense Connector for MLLMsβ180Updated last year
- [CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".β204Updated 7 months ago
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Promptsβ336Updated last year
- Official repository for paper MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning(https://arxiv.org/abs/2406.17770).β159Updated last year
- [COLM'25] Official implementation of the Law of Vision Representation in MLLMsβ176Updated 4 months ago
- Implements VAR+CLIP for text-to-image (T2I) generationβ147Updated last year
- [NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Contextβ173Updated last year
- [CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"β211Updated last year
- Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"β307Updated 4 months ago
- [ECCV 2024] Official PyTorch implementation of DreamLIP: Language-Image Pre-training with Long Captionsβ137Updated 9 months ago
- [NeurIPS 2024] Classification Done Right for Vision-Language Pre-Trainingβ227Updated 10 months ago
- My implementation of "Patch nβ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"β268Updated 3 weeks ago
- [CVPR 2024] PixelLM is an effective and efficient LMM for pixel-level reasoning and understanding.β252Updated last year
- [CVPR'24] The repository provides code for running inference and training for "Segment and Caption Anything" (SCA) , links for downloadinβ¦β231Updated last year
- [ICLR2025] LLaVA-HR: High-Resolution Large Language-Vision Assistantβ246Updated last year
- Official implementation of UnifiedReward & [NeurIPS 2025] UnifiedReward-Think & UnifiedReward-Flexβ699Updated this week
- When do we not need larger vision models?β412Updated last year
- [NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuningβ256Updated 3 months ago
- β160Updated last year
- β360Updated 2 years ago
- Code for MetaMorph Multimodal Understanding and Generation via Instruction Tuningβ234Updated 2 weeks ago
- [NeurIPS 2025 Spotlight] A Unified Tokenizer for Visual Generation and Understandingβ507Updated 2 months ago
- Densely Captioned Images (DCI) dataset repository.β195Updated last year
- [ICLR'25] Reconstructive Visual Instruction Tuningβ135Updated 10 months ago
- TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generationβ236Updated 5 months ago
- [CVPR 2025] FLAIR: VLM with Fine-grained Language-informed Image Representationsβ132Updated 5 months ago