SHI-Labs / OLA-VLM
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024
☆49Updated 3 weeks ago
Alternatives and similar repositories for OLA-VLM:
Users that are interested in OLA-VLM are comparing it to the libraries listed below
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆40Updated last month
- Official Pytorch Implementation of Self-emerging Token Labeling☆32Updated 10 months ago
- This is the official repo for ByteVideoLLM/Dynamic-VLM☆19Updated 2 months ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆41Updated 6 months ago
- 🤖 [ICLR'25] Multimodal Video Understanding Framework (MVU)☆27Updated 2 weeks ago
- Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"☆19Updated 3 weeks ago
- Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆128Updated 2 months ago
- PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"☆18Updated 3 months ago
- [AAAI2025] ChatterBox: Multi-round Multimodal Referring and Grounding, Multimodal, Multi-round dialogues☆50Updated 2 months ago
- ☆29Updated 3 weeks ago
- ☆41Updated last month
- LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding☆17Updated last month
- Official implementation of the paper "MMInA: Benchmarking Multihop Multimodal Internet Agents"☆41Updated last week
- The official repo of continuous speculative decoding☆24Updated 2 months ago
- Code for "AVG-LLaVA: A Multimodal Large Model with Adaptive Visual Granularity"☆23Updated 4 months ago
- Official implementation and dataset for the NAACL 2024 paper "ComCLIP: Training-Free Compositional Image and Text Matching"☆35Updated 6 months ago
- [ICLR 2025] Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegr…☆67Updated 2 months ago
- VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".☆95Updated 7 months ago
- [NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration☆22Updated 4 months ago
- Video-LlaVA fine-tune for CinePile evaluation☆46Updated 6 months ago
- Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆61Updated 5 months ago
- [WACV 2025] Official implementation of "Online-LoRA: Task-free Online Continual Learning via Low Rank Adaptation" by Xiwen Wei, Guihong L…☆31Updated 3 months ago
- [NeurIPS 2024] Official implementation of the paper "Interfacing Foundation Models' Embeddings"☆120Updated 5 months ago
- This repository provides an improved LLamaGen Model, fine-tuned on 500,000 high-quality images, each accompanied by over 300 token prompt…☆30Updated 3 months ago
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…☆35Updated 8 months ago
- Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model☆41Updated last month
- [ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision☆59Updated 7 months ago
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆63Updated 5 months ago
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆29Updated 4 months ago
- Pytorch implementation of HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models☆28Updated 10 months ago