Hoar012 / RAP-MLLM
Retrieval-Augmented Personalization
☆11Updated last month
Alternatives and similar repositories for RAP-MLLM:
Users that are interested in RAP-MLLM are comparing it to the libraries listed below
- Official Repository of Personalized Visual Instruct Tuning☆26Updated 2 months ago
- This is the official repo for ByteVideoLLM/Dynamic-VLM☆18Updated 3 weeks ago
- This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"☆19Updated 2 weeks ago
- [NeurIPS 2024] Official PyTorch implementation of "Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives"☆32Updated last month
- [EMNLP 2024] Official code for "Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models"☆14Updated 2 months ago
- [AAAI 2025] HiRED strategically drops visual tokens in the image encoding stage to improve inference efficiency for High-Resolution Visio…☆17Updated this week
- PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models☆18Updated 3 weeks ago
- Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*☆31Updated this week
- Official implementation of paper ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding☆14Updated this week
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆35Updated last week
- FreeVA: Offline MLLM as Training-Free Video Assistant☆54Updated 7 months ago
- ☆44Updated 8 months ago
- ☆36Updated 2 months ago
- [ECCV2024] Learning Video Context as Interleaved Multimodal Sequences☆32Updated 3 months ago
- ☆47Updated this week
- The official pytorch implement of "Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsifi…☆13Updated last month
- 🔥 [CVPR 2024] Official implementation of "See, Say, and Segment: Teaching LMMs to Overcome False Premises (SESAME)"☆30Updated 6 months ago
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models☆26Updated 2 months ago
- SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image and Video Generation (arXiv: 2410.12761)☆19Updated 2 months ago
- [NeurIPS 2024] MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models☆42Updated last month
- Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning☆18Updated 4 months ago
- ☆32Updated 5 months ago
- ☆19Updated 2 months ago
- ☆14Updated 2 months ago
- ☕️ CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆29Updated 6 months ago
- ☆28Updated this week
- ☆26Updated 5 months ago
- Code for paper: Unified Text-to-Image Generation and Retrieval☆13Updated 6 months ago
- (NeurIPS 2024) What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights☆22Updated 2 months ago