deepglint / RWKV-CLIP
[EMNLP 2024] RWKV-CLIP: A Robust Vision-Language Representation Learner
☆129Updated 2 months ago
Alternatives and similar repositories for RWKV-CLIP:
Users that are interested in RWKV-CLIP are comparing it to the libraries listed below
- [CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"☆199Updated 9 months ago
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆154Updated 3 months ago
- ☆111Updated 7 months ago
- [CVPR 2025] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive…☆237Updated 2 months ago
- My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"☆224Updated 2 months ago
- A Simple Framework of Small-scale Large Multimodal Models for Video Understanding Based on TinyLLaVA_Factory.☆46Updated last week
- 【NeurIPS 2024】Dense Connector for MLLMs☆157Updated 5 months ago
- Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models☆60Updated 4 months ago
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture☆199Updated 2 months ago
- VisualRWKV is the visual-enhanced version of the RWKV language model, enabling RWKV to handle various visual tasks.☆214Updated 3 weeks ago
- Explore the Limits of Omni-modal Pretraining at Scale☆97Updated 6 months ago
- [ICLR 2025] LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation☆116Updated 2 months ago
- EVE Series: Encoder-Free Vision-Language Models from BAAI☆313Updated 3 weeks ago
- ☆87Updated 8 months ago
- Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"☆145Updated 2 months ago
- E5-V: Universal Embeddings with Multimodal Large Language Models☆237Updated 3 months ago
- Scaling RWKV-Like Architectures for Diffusion Models☆126Updated 11 months ago
- ☆166Updated 8 months ago
- [NeurIPS 2024] Classification Done Right for Vision-Language Pre-Training☆204Updated last week
- DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception☆137Updated 3 months ago
- [CVPR 2024] CapsFusion: Rethinking Image-Text Data at Scale☆205Updated last year
- Implementation of ViTaR: ViTAR: Vision Transformer with Any Resolution in PyTorch☆32Updated 4 months ago
- Lion: Kindling Vision Intelligence within Large Language Models☆52Updated last year
- [NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context☆149Updated 6 months ago
- ☆70Updated 4 months ago
- [CVPR 2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".☆138Updated 3 weeks ago
- LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer☆370Updated last week
- a family of highly capabale yet efficient large multimodal models☆178Updated 7 months ago
- Precision Search through Multi-Style Inputs☆65Updated 8 months ago
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation☆86Updated 6 months ago