Yuan-ManX / ai-multimodal-timelineLinks
Here we will track the latest AI Multimodal Models, including Multimodal Foundation Models, LLM, Agent, Audio, Image, Video, Music and 3D content. π₯
β35Updated 8 months ago
Alternatives and similar repositories for ai-multimodal-timeline
Users that are interested in ai-multimodal-timeline are comparing it to the libraries listed below
Sorting:
- [ACL2025 Oral & Award] Evaluate Image/Video Generation like Humans - Fast, Explainable, Flexibleβ104Updated 2 months ago
- Controllable Animation Video Generation with Large Models-based Multimodal Agentsβ205Updated 3 weeks ago
- Live2Diff: A Pipeline that processes Live video streams by a uni-directional video Diffusion model.β195Updated last year
- InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructionsβ129Updated last year
- A one-stop library to standardize the inference and evaluation of all the conditional video generation models.β50Updated 8 months ago
- β179Updated 5 months ago
- Visual RAG using less than 300 lines of code.β29Updated last year
- Implementation for the paper "ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems".β193Updated 7 months ago
- An open source community implementation of the model from the paper: "Movie Gen: A Cast of Media Foundation Models". Join our community β¦β58Updated last week
- β55Updated 11 months ago
- β69Updated last year
- Official PyTorch implementation of TokenSet.β126Updated 7 months ago
- A streamlined implementation of Grounding DINO and SAM for advanced image segmentation. This lightweight solution simplifies the integratβ¦β64Updated last year
- Code release for the paper, "Proactive Agents for Text-to-Image Generation under Uncertainty"β56Updated 3 months ago
- Enhancement in Multimodal Representation Learning.β40Updated last year
- Video-Infinity generates long videos quickly using multiple GPUs without extra training.β185Updated last year
- β35Updated 2 years ago
- [CVPR 2024] VCoder: Versatile Vision Encoders for Multimodal Large Language Modelsβ277Updated last year
- β206Updated last year
- INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Modelβ42Updated last year
- β35Updated 9 months ago
- β194Updated last year
- β18Updated 6 months ago
- β13Updated last year
- Inference-time scaling of diffusion-based image and video generation models.β169Updated 4 months ago
- β86Updated last year
- Code of "Style Customization of Text-to-Vector Generation with Image Diffusion Priors"β88Updated 5 months ago
- [NeurIPS 2025] Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation, arXiv 2024β64Updated last week
- Krea Realtime 14B. An open-source realtime AI video model.β175Updated last week
- [ICCV2025] WikiAutoGen offical pageβ20Updated 4 months ago