SHI-Labs / OLA-VLM
OLA-VLM: Elevating Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024
☆44Updated 3 weeks ago
Alternatives and similar repositories for OLA-VLM:
Users that are interested in OLA-VLM are comparing it to the libraries listed below
- Official Pytorch Implementation of Self-emerging Token Labeling☆32Updated 9 months ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆35Updated last week
- Multimodal Video Understanding Framework (MVU)☆26Updated 7 months ago
- Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"☆18Updated last month
- ☆28Updated last month
- Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*☆31Updated this week
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆28Updated 3 months ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆41Updated 5 months ago
- ☆28Updated this week
- ☆39Updated last month
- A big_vision inspired repo that implements a generic Auto-Encoder class capable in representation learning and generative modeling.☆31Updated 6 months ago
- Official implementation of the paper "MMInA: Benchmarking Multihop Multimodal Internet Agents"☆40Updated 8 months ago
- [AAAI2025] ChatterBox: Multi-round Multimodal Referring and Grounding, Multimodal, Multi-round dialogues☆50Updated 3 weeks ago
- ☆62Updated last month
- Pytorch implementation of HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models☆28Updated 9 months ago
- A benchmark dataset and simple code examples for measuring the perception and reasoning of multi-sensor Vision Language models.☆16Updated 2 weeks ago
- [Under Review] Official PyTorch implementation code for realizing the technical part of Phantom of Latent representing equipped with enla…☆48Updated 3 months ago
- Video-LlaVA fine-tune for CinePile evaluation☆45Updated 5 months ago
- Official implementation of "Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models"☆35Updated last year
- MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment☆31Updated 6 months ago
- ☆34Updated 11 months ago
- Code and Data for Paper: SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data☆33Updated 9 months ago
- Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image …☆62Updated last month
- Code for "AVG-LLaVA: A Multimodal Large Model with Adaptive Visual Granularity"☆19Updated 2 months ago
- Official implementation and dataset for the NAACL 2024 paper "ComCLIP: Training-Free Compositional Image and Text Matching"☆35Updated 4 months ago
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆63Updated 4 months ago
- ☆64Updated 6 months ago
- [NeurIPS 2024] Official implementation of the paper "Interfacing Foundation Models' Embeddings"☆118Updated 4 months ago
- A Framework for Decoupling and Assessing the Capabilities of VLMs☆40Updated 6 months ago