ziqipang / LM4VisualEncoding
[ICLR 2024 (Spotlight)] "Frozen Transformers in Language Models are Effective Visual Encoder Layers"
☆229Updated last year
Alternatives and similar repositories for LM4VisualEncoding:
Users that are interested in LM4VisualEncoding are comparing it to the libraries listed below
- ☆304Updated 11 months ago
- [NeurIPS 2023] Text data, code and pre-trained models for paper "Improving CLIP Training with Language Rewrites"☆264Updated last year
- Official implementation of the Law of Vision Representation in MLLMs☆145Updated 2 months ago
- official implementation of "Interpreting CLIP's Image Representation via Text-Based Decomposition"☆184Updated last month
- [Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought …☆202Updated 3 weeks ago
- Official implementation of "Why are Visually-Grounded Language Models Bad at Image Classification?" (NeurIPS 2024)☆66Updated 2 months ago
- LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models☆109Updated 8 months ago
- Open source implementation of "Vision Transformers Need Registers"☆162Updated 2 months ago
- A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models!☆123Updated last year
- This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.or…☆112Updated 6 months ago
- ☆117Updated 6 months ago
- [CVPR'24] Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities☆99Updated 10 months ago
- [NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models☆261Updated 3 months ago
- The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A su…☆218Updated this week
- [ECCV 2024🔥] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"☆134Updated 4 months ago
- [NeurIPS2024] Repo for the paper `ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models'☆136Updated this week
- Implementation of "VL-Mamba: Exploring State Space Models for Multimodal Learning"☆79Updated 9 months ago
- Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning☆68Updated 7 months ago
- 【NeurIPS 2024】Dense Connector for MLLMs☆154Updated 3 months ago
- 🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".☆222Updated 2 weeks ago
- When do we not need larger vision models?☆354Updated last month
- Official repo for "VisionZip: Longer is Better but Not Necessary in Vision Language Models"☆219Updated 2 weeks ago
- [CVPR 2024] Prompt Highlighter: Interactive Control for Multi-Modal LLMs☆137Updated 5 months ago
- [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts☆308Updated 6 months ago
- ☆132Updated last year
- My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"☆207Updated 2 months ago
- VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation☆199Updated this week
- ChatBridge, an approach to learning a unified multimodal model to interpret, correlate, and reason about various modalities without rely…☆49Updated last year
- ☆44Updated 8 months ago
- [CVPR 2024] ViT-Lens: Towards Omni-modal Representations☆169Updated 6 months ago