OpenGVLab / TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
☆40Updated last month
Alternatives and similar repositories for TPO:
Users that are interested in TPO are comparing it to the libraries listed below
- This is the official repo for ByteVideoLLM/Dynamic-VLM☆19Updated 2 months ago
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆41Updated 6 months ago
- [NeurIPS 2024] Efficient Large Multi-modal Models via Visual Context Compression☆51Updated last week
- [ICLR2025] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want☆65Updated 3 weeks ago
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…☆35Updated 8 months ago
- [NeurIPS 2024 D&B Track] Official Repo for "LVD-2M: A Long-take Video Dataset with Temporally Dense Captions"☆45Updated 4 months ago
- Code for "AVG-LLaVA: A Multimodal Large Model with Adaptive Visual Granularity"☆23Updated 4 months ago
- Official Repository of Personalized Visual Instruct Tuning☆26Updated 3 months ago
- LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding☆17Updated last month
- Code Release of F-LMM: Grounding Frozen Large Multimodal Models☆62Updated 6 months ago
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆63Updated 5 months ago
- [ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM☆67Updated 3 months ago
- [ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark☆75Updated 3 weeks ago
- [AAAI2025] ChatterBox: Multi-round Multimodal Referring and Grounding, Multimodal, Multi-round dialogues☆50Updated 2 months ago
- VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".☆95Updated 7 months ago
- 🔥 [CVPR 2024] Official implementation of "See, Say, and Segment: Teaching LMMs to Overcome False Premises (SESAME)"☆32Updated 8 months ago
- [ICLR 2025] Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegr…☆67Updated 2 months ago
- [NeurIPS 2024] Official implementation of the paper "Interfacing Foundation Models' Embeddings"☆120Updated 5 months ago
- ACL'24 (Oral) Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback☆59Updated 5 months ago
- VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection☆50Updated last month
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation, arXiv 2024☆49Updated 3 weeks ago
- The official code of the paper "PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction".☆52Updated last month
- ☆56Updated 9 months ago
- Official repo for StableLLAVA☆94Updated last year
- Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model☆41Updated last month
- ✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models☆150Updated last month
- Official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning"☆26Updated last week
- ☆26Updated 6 months ago
- Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".☆44Updated last month