hanghuacs / MMComposition
☆16Updated 3 months ago
Alternatives and similar repositories for MMComposition:
Users that are interested in MMComposition are comparing it to the libraries listed below
- [CVPR 2025] VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?☆21Updated 2 weeks ago
- [AAAI 2025] Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding☆19Updated last week
- Official repository of NeurIPS D&B Track 2024 paper "VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understan…☆33Updated 2 months ago
- LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos. (CVPR 2025))☆18Updated last week
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆111Updated 3 months ago
- WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation☆54Updated 2 weeks ago
- official repo for "VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation" [EMNLP2024]☆85Updated last month
- [NAACL 2024] LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-text Generation?☆37Updated 9 months ago
- PyTorch implementation of InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following☆30Updated 2 months ago
- FQGAN: Factorized Visual Tokenization and Generation☆46Updated this week
- [AAAI 2025] Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos☆23Updated 6 months ago
- [NeurIPS 2024 D&B Track] Official Repo for "LVD-2M: A Long-take Video Dataset with Temporally Dense Captions"☆53Updated 5 months ago
- ☆29Updated 2 weeks ago
- Official PyTorch code of "Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation".☆22Updated last month
- ☆23Updated 6 months ago
- ☆27Updated 5 months ago
- [CVPR2025] Number it: Temporal Grounding Videos like Flipping Manga☆67Updated this week
- [CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos☆52Updated this week
- [ECCV’24] Official Implementation for CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenario…☆52Updated 6 months ago
- Official code for CVPR 2024 paper: Discriminative Probing and Tuning for Text-to-Image Generation☆30Updated this week
- (ICLR 2025 Spotlight) Official code repository for Interleaved Scene Graph.☆18Updated last month
- The official implementation of A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation☆17Updated 4 months ago
- The official repository for paper "PruneVid: Visual Token Pruning for Efficient Video Large Language Models".☆35Updated last month
- ☆86Updated 3 months ago
- [CVPR 2025] OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?☆40Updated last week
- Official Implementation of VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention☆33Updated last week
- [ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark☆86Updated 2 months ago
- ☆29Updated 2 months ago
- Empowering Unified MLLM with Multi-granular Visual Generation☆119Updated 2 months ago
- Exposing Text-Image Inconsistency Using Diffusion Models (ICLR 2024)☆10Updated 9 months ago