JaaackHongggg / WorldSense
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
☆19Updated last month
Alternatives and similar repositories for WorldSense:
Users that are interested in WorldSense are comparing it to the libraries listed below
- This is the official repo for ByteVideoLLM/Dynamic-VLM☆20Updated 3 months ago
- FreeVA: Offline MLLM as Training-Free Video Assistant☆57Updated 9 months ago
- This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"☆23Updated 3 months ago
- ☆32Updated 8 months ago
- Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision☆38Updated last week
- ☆29Updated 8 months ago
- [NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration☆23Updated 5 months ago
- [CVPR 2025] Adaptive Keyframe Sampling for Long Video Understanding☆42Updated last week
- [ECCV 2024] Learning Video Context as Interleaved Multimodal Sequences☆36Updated 3 weeks ago
- ☆37Updated 3 months ago
- ACL'24 (Oral) Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback☆63Updated 6 months ago
- Official implement of MIA-DPO☆54Updated 2 months ago
- 「AAAI 2024」 Referred by Multi-Modality: A Unified Temporal Transformers for Video Object Segmentation☆77Updated 9 months ago
- ☆23Updated this week
- The official repository for paper "PruneVid: Visual Token Pruning for Efficient Video Large Language Models".☆35Updated last month
- [ECCV'24 Oral] PiTe: Pixel-Temporal Alignment for Large Video-Language Model☆16Updated last month
- [CVPR 2025] LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding☆38Updated last month
- Official PyTorch code of GroundVQA (CVPR'24)☆59Updated 6 months ago
- [ICLR 2025] CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆41Updated 2 months ago
- Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning☆16Updated last week
- VideoHallucer, The first comprehensive benchmark for hallucination detection in large video-language models (LVLMs)☆27Updated this week
- 👾 E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding (NeurIPS 2024)☆57Updated 2 months ago
- VisRL: Intention-Driven Visual Perception via Reinforced Reasoning☆20Updated 2 weeks ago
- [CVPR'25] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection☆65Updated this week
- [CVPR 2025] OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?☆40Updated last week
- VideoNIAH: A Flexible Synthetic Method for Benchmarking Video MLLMs☆46Updated 3 weeks ago
- Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges☆68Updated last month
- ☆28Updated 4 months ago
- 🌈 Unifying Visual Understanding and Generation with Dual Visual Vocabularies☆26Updated last week
- LLMBind: A Unified Modality-Task Integration Framework☆18Updated 9 months ago