zehanwang01 / OmniBindLinks
☆33Updated 7 months ago
Alternatives and similar repositories for OmniBind
Users that are interested in OmniBind are comparing it to the libraries listed below
Sorting:
- [CVPR 2024] Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities☆100Updated last year
- [NeurIPS 2024] Efficient Large Multi-modal Models via Visual Context Compression☆61Updated 9 months ago
- Official implement of MIA-DPO☆67Updated 9 months ago
- [ICCV 2025] Explore the Limits of Omni-modal Pretraining at Scale☆118Updated last year
- Code for "CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning"☆25Updated 7 months ago
- ACL'24 (Oral) Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback☆76Updated last year
- [arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation☆93Updated 8 months ago
- WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs☆33Updated last month
- [ICCV 2025] Dynamic-VLM☆26Updated 11 months ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆61Updated 4 months ago
- [ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark☆131Updated 5 months ago
- [ICCV 2025 Oral] Official implementation of Learning Streaming Video Representation via Multitask Training.☆66Updated last month
- Official implementation of Next Block Prediction: Video Generation via Semi-Autoregressive Modeling☆39Updated 9 months ago
- ☆78Updated 4 months ago
- This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"☆30Updated 10 months ago
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…☆66Updated last year
- [ECCV'24 Oral] PiTe: Pixel-Temporal Alignment for Large Video-Language Model☆17Updated 9 months ago
- [CVPR 2025 Oral] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection☆127Updated 3 months ago
- [CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".☆194Updated 5 months ago
- [ICLR2025] MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models☆89Updated last year
- ☆43Updated last year
- FreeVA: Offline MLLM as Training-Free Video Assistant☆65Updated last year
- The code for "VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by VIdeo SpatioTemporal Augmentation" [CVPR2025]☆20Updated 8 months ago
- (ICCV2025) Official repository of paper "ViSpeak: Visual Instruction Feedback in Streaming Videos"☆40Updated 4 months ago
- Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?☆77Updated 4 months ago
- [NeurIPS 2025] HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation☆71Updated 2 months ago
- Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"☆48Updated 2 months ago
- Official repo for StableLLAVA☆94Updated last year
- [CVPR 2024] ViT-Lens: Towards Omni-modal Representations☆183Updated 9 months ago
- On Path to Multimodal Generalist: General-Level and General-Bench☆19Updated 4 months ago