yunlong10 / AVicuna
[AAAI 2025] Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding
☆19Updated last week
Alternatives and similar repositories for AVicuna:
Users that are interested in AVicuna are comparing it to the libraries listed below
- ☆16Updated 3 months ago
- [CVPR 2025] VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?☆21Updated 2 weeks ago
- Official repository of NeurIPS D&B Track 2024 paper "VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understan…☆33Updated 2 months ago
- ☆27Updated 5 months ago
- [ECCV’24] Official Implementation for CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenario…☆52Updated 6 months ago
- Official Implementation of VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention☆33Updated last week
- [Arxiv 2024] Official code for MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions☆31Updated last month
- LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos. (CVPR 2025))☆18Updated last week
- ☆32Updated last year
- PyTorch implementation of InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following☆30Updated 2 months ago
- A post-training method to enhance CLIP's fine-grained visual representations with generative models.☆21Updated this week
- FQGAN: Factorized Visual Tokenization and Generation☆46Updated this week
- Exposing Text-Image Inconsistency Using Diffusion Models (ICLR 2024)☆10Updated 9 months ago
- [NeurIPS 2024] COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing☆23Updated 3 months ago
- ☆15Updated 5 months ago
- ☆19Updated 7 months ago
- Unified Audio-Visual Perception for Multi-Task Video Localization☆24Updated 11 months ago
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆111Updated 3 months ago
- Official Implementation of VideoDPO☆76Updated 2 months ago
- Frequency Autoregressive Image Generation with Continuous Tokens☆42Updated 3 weeks ago
- The official implementation of A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation☆17Updated 4 months ago
- Code for: "Long-Context Autoregressive Video Modeling with Next-Frame Prediction"☆112Updated this week
- R1-like Video-LLM for Temporal Grounding☆62Updated last week
- Accepted by CVPR 2024☆33Updated 10 months ago
- official repo for "VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation" [EMNLP2024]☆85Updated last month
- [CVPR 2024] Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners☆140Updated 8 months ago
- Official implementation of HawkEye: Training Video-Text LLMs for Grounding Text in Videos☆40Updated 11 months ago
- Official Pytorch implementation for LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior (ICLR 2025 Oral).☆58Updated last month
- Code release for "PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop" (arXiv 2025)☆24Updated last week
- 🌀 R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding (ECCV 2024)☆80Updated 9 months ago