SHI-Labs / OLA-VLM
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024
☆57Updated 3 weeks ago
Alternatives and similar repositories for OLA-VLM:
Users that are interested in OLA-VLM are comparing it to the libraries listed below
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆43Updated 2 months ago
- Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆65Updated 3 weeks ago
- [ICLR 2025] Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegr…☆68Updated 3 months ago
- [CVPR2025] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆155Updated this week
- [AAAI2025] ChatterBox: Multi-round Multimodal Referring and Grounding, Multimodal, Multi-round dialogues☆53Updated 3 months ago
- ☆42Updated 2 months ago
- [Under Review] Official PyTorch implementation code for realizing the technical part of Phantom of Latent representing equipped with enla…☆52Updated 5 months ago
- 🤖 [ICLR'25] Multimodal Video Understanding Framework (MVU)☆29Updated last month
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆42Updated 7 months ago
- Matryoshka Multimodal Models☆98Updated last month
- Code for "AVG-LLaVA: A Multimodal Large Model with Adaptive Visual Granularity"☆25Updated 5 months ago
- [NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context☆149Updated 5 months ago
- Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"☆21Updated last month
- Reproduction of LLaVA-v1.5 based on Llama-3-8b LLM backbone.☆64Updated 4 months ago
- ☆66Updated 2 months ago
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture☆199Updated 2 months ago
- This is the official repo for ByteVideoLLM/Dynamic-VLM☆20Updated 3 months ago
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆154Updated 2 months ago
- Official Pytorch Implementation of Self-emerging Token Labeling☆32Updated 11 months ago
- [ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision☆59Updated 8 months ago
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆29Updated 5 months ago
- [TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"☆130Updated 4 months ago
- Official repo for StableLLAVA☆94Updated last year
- MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment☆33Updated 8 months ago
- [CVPR2025] PAR: Parallelized Autoregressive Visual Generation. https://epiphqny.github.io/PAR-project/☆126Updated 2 months ago
- Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*☆96Updated 3 weeks ago
- Pytorch implementation of HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models☆28Updated last year
- ☆49Updated 3 months ago