Flowerfan / VistaLLaMALinks
☆14Updated last year
Alternatives and similar repositories for VistaLLaMA
Users that are interested in VistaLLaMA are comparing it to the libraries listed below
Sorting:
- Compress conventional Vision-Language Pre-training data☆53Updated 2 years ago
- Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning☆24Updated last year
- ☆16Updated 9 months ago
- ☆32Updated last year
- [ECCV2024] Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models☆20Updated last year
- Official Implementation (Pytorch) of the "VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Capti…☆23Updated 11 months ago
- (ICML 2024) Improve Context Understanding in Multimodal Large Language Models via Multimodal Composition Learning☆28Updated last year
- (NeurIPS 2024 Spotlight) TOPA: Extend Large Language Models for Video Understanding via Text-Only Pre-Alignment☆30Updated last year
- [CVPR2024] The code of "UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory"☆68Updated last year
- Rui Qian, Xin Yin, Dejing Dou†: Reasoning to Attend: Try to Understand How <SEG> Token Works (CVPR 2025)☆51Updated 2 months ago
- FreeVA: Offline MLLM as Training-Free Video Assistant☆68Updated last year
- Turning to Video for Transcript Sorting☆48Updated 2 years ago
- [ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences☆40Updated 10 months ago
- 【NeurIPS 2024】The official code of paper "Automated Multi-level Preference for MLLMs"☆20Updated last year
- Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning☆38Updated 5 months ago
- Official pytorch implementation of "RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in Large Vision Language…☆14Updated last year
- This is the official implementation of ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos☆40Updated 2 months ago
- [ACL 2025] PruneVid: Visual Token Pruning for Efficient Video Large Language Models☆63Updated 7 months ago
- [CVPR 2024] Context-Guided Spatio-Temporal Video Grounding☆64Updated last year
- VideoHallucer, The first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)☆42Updated 3 weeks ago
- [AAAI 2024] DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval.☆47Updated last year
- COLA: Evaluate how well your vision-language model can Compose Objects Localized with Attributes!☆25Updated last year
- [ECCV 2022] GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval☆16Updated 3 years ago
- Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision☆44Updated 2 months ago
- Official implement of MIA-DPO☆70Updated 11 months ago
- ✨A curated list of papers on the uncertainty in multi-modal large language model (MLLM).☆57Updated 9 months ago
- [CVPR 2025] Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering☆53Updated 5 months ago
- Official Pytorch implementation of 'Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning'? (ICLR2024)☆13Updated last year
- Codes for ICLR 2025 Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM☆76Updated 8 months ago
- VideoNIAH: A Flexible Synthetic Method for Benchmarking Video MLLMs☆53Updated 10 months ago