yunlong10 / VidComposition
π See How Top MLLMs Understand Video Compositions.
β14Updated this week
Related projects β
Alternatives and complementary repositories for VidComposition
- β10Updated 3 weeks ago
- β18Updated last month
- [CVPR 2024] Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Alignersβ128Updated 4 months ago
- Papers and codes collection for customized, personalized and editable generative modelsβ23Updated last month
- Official implementation for BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Wayβ18Updated last month
- [CVPR 2024] Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detectionβ75Updated 4 months ago
- Accepted by CVPR 2024β28Updated 6 months ago
- [ECCVβ24] Official Implementation for CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarioβ¦β41Updated 2 months ago
- [NeurIPS 2024] Visual Perception by Large Language Modelβs Weightsβ29Updated last month
- Official implementation of "ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video" (ECCV2024)β18Updated 3 months ago
- Research code for NeurIPS 2023 paper "Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser"β15Updated last year
- [ECCV 2024 Oral] Audio-Synchronized Visual Animationβ37Updated 2 months ago
- [ICCV 2023] DiffusionRet: Generative Text-Video Retrieval with Diffusion Modelβ120Updated 7 months ago
- A list of works on evaluation of visual generation models, including evaluation metrics, models, and systemsβ197Updated 2 months ago
- β31Updated 8 months ago
- β21Updated last month
- You can easily calculate FVD, PSNR, SSIM, LPIPS for evaluating the quality of generated or predicted videos.β240Updated 5 months ago
- β14Updated 5 months ago
- [NAACL 2024] LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-text Generation?β37Updated 5 months ago
- Official code for WACV 2024 paper, "Annotation-free Audio-Visual Segmentation"β26Updated last month
- Vision Transformers are Parameter-Efficient Audio-Visual Learnersβ89Updated last year
- β13Updated last month
- Unified Audio-Visual Perception for Multi-Task Video Localizationβ22Updated 7 months ago
- NeurIPS'2023 official implementation codeβ59Updated last year
- Official Implementation of "Open-Vocabulary Audio-Visual Semantic Segmentation" [ACM MM 2024 Oral].β14Updated 2 weeks ago
- [Arxiv 2024] Official code for MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptionsβ24Updated this week
- β28Updated last month
- [CVPR'23 Highlight] AutoAD: Movie Description in Context.β88Updated 2 weeks ago