OpenGVLab / TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
☆35Updated last week
Alternatives and similar repositories for TPO:
Users that are interested in TPO are comparing it to the libraries listed below
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model☆41Updated 5 months ago
- This is the official repo for ByteVideoLLM/Dynamic-VLM☆18Updated 3 weeks ago
- [NeurIPS 2024 D&B Track] Official Repo for "LVD-2M: A Long-take Video Dataset with Temporally Dense Captions"☆45Updated 2 months ago
- VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection☆40Updated last week
- ☆46Updated 3 weeks ago
- Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want☆63Updated 2 months ago
- Diffusion Powers Video Tokenizer for Comprehension and Generation☆38Updated last month
- Official Repository of Personalized Visual Instruct Tuning☆26Updated 2 months ago
- [NeurIPS 2024] Efficient Multi-modal Models via Stage-wise Visual Context Compression☆49Updated 5 months ago
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…☆35Updated 6 months ago
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆63Updated 4 months ago
- 🔥 [CVPR 2024] Official implementation of "See, Say, and Segment: Teaching LMMs to Overcome False Premises (SESAME)"☆30Updated 6 months ago
- Official repo for StableLLAVA☆94Updated last year
- Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*☆31Updated this week
- [NeurIPS 2024] EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models.☆44Updated 2 months ago
- ☆107Updated 5 months ago
- ACL'24 (Oral) Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback☆56Updated 3 months ago
- Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction☆34Updated this week
- Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image …☆62Updated last month
- Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆59Updated 3 months ago
- The official implementation of PAR: Parallelized Autoregressive Visual Generation. https://epiphqny.github.io/PAR-project/☆102Updated last week
- 🔥 Aurora Series: A more efficient multimodal large language model series for video.☆61Updated last month
- Visual Programming for Text-to-Image Generation and Evaluation (NeurIPS 2023)☆54Updated last year
- ☆62Updated last month
- Matryoshka Multimodal Models☆90Updated last month
- [AAAI2025] ChatterBox: Multi-round Multimodal Referring and Grounding, Multimodal, Multi-round dialogues☆50Updated 3 weeks ago
- VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".☆88Updated 6 months ago
- Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".☆43Updated last week
- This is the official repository of our paper "What If We Recaption Billions of Web Images with LLaMA-3 ?"☆125Updated 6 months ago
- [NeurIPS 2024] Official implementation of the paper "Interfacing Foundation Models' Embeddings"☆118Updated 4 months ago