Daria8976 / MMADLinks
We propose MMAD, a novel automated pipeline for precise AD generation. MMAD introduces ambient music alongside visual and linguistic, enhancing the model's multimodal representation learning through modality encoders and alignment.
β15Updated 8 months ago
Alternatives and similar repositories for MMAD
Users that are interested in MMAD are comparing it to the libraries listed below
Sorting:
- This is the official implementation of 2024 CVPR paper "EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models".β87Updated 8 months ago
- π R2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding (ECCV 2024)β88Updated last year
- Narrative movie understanding benchmarkβ77Updated 3 months ago
- [CVPR 2024] Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Modelsβ255Updated 9 months ago
- [CVPR 2024] Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detectionβ106Updated last year
- [ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captionsβ234Updated last year
- β43Updated 10 months ago
- [CVPR2025] Number it: Temporal Grounding Videos like Flipping Mangaβ121Updated 3 weeks ago
- Official repository of NeurIPS D&B Track 2024 paper "VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanβ¦β36Updated 8 months ago
- What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughnessβ23Updated 4 months ago
- Official implementation of paper AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understandingβ82Updated 5 months ago
- [CVPR 2025] Online Video Understanding: OVBench and VideoChat-Onlineβ66Updated last month
- β192Updated last year
- This is the official implementation of ICCV 2025 "Flash-VStream: Efficient Real-Time Understanding for Long Video Streams"β235Updated 2 months ago
- β26Updated 5 months ago
- (ICCV2025) Official repository of paper "ViSpeak: Visual Instruction Feedback in Streaming Videos"β40Updated 2 months ago
- [ICLR 2025] TRACE: Temporal Grounding Video LLM via Casual Event Modelingβ124Updated last month
- Use 2 lines to empower absolute time awareness for Qwen2.5VL's MRoPEβ23Updated last week
- Official code of SmartEdit [CVPR-2024 Highlight]β357Updated last year
- [WWW 2025] Official PyTorch Code for "CTR-Driven Advertising Image Generation with Multimodal Large Language Models"β56Updated last month
- A Large-scale Dataset for training and evaluating model's ability on Dense Text Image Generationβ79Updated last month
- [CVPR'2024 Highlight] Official PyTorch implementation of the paper "VTimeLLM: Empower LLM to Grasp Video Moments".β287Updated last year
- This is an early exploration to introduce Interleaving Reasoning to Text-to-image Generation field and achieve the SoTA benchmark performβ¦β58Updated last week
- TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generationβ217Updated last month
- β156Updated 8 months ago
- [EMNLP 2025 Findings] Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Modelsβ125Updated last month
- [CVPR2024] MotionEditor is the first diffusion-based model capable of video motion editing.β178Updated 2 weeks ago
- Official implementation of paper ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understandingβ36Updated 6 months ago
- Structured Video Comprehension of Real-World Shortsβ201Updated this week
- β78Updated 6 months ago