gpt4video / GPT4VideoLinks

Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

☆144

Alternatives and similar repositories for GPT4Video

Users that are interested in GPT4Video are comparing it to the libraries listed below

Sorting:

Deaddawn / DreamFrame-code
☆184Updated 4 months ago
sterzhang / image-textualization
Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)
☆169Updated last year
md-mohaiminul / VideoRecap
☆201Updated last year
aim-uofa / AutoStory
[IJCV'24] AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort
☆151Updated last year
SHI-Labs / VCoder
[CVPR 2024] VCoder: Versatile Vision Encoders for Multimodal Large Language Models
☆280Updated last year
WangWenhao0716 / VidProM
[NeurIPS 2024] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models
☆167Updated last year
FudanNLPLAB / MouSi
☆75Updated last year
icoz69 / StableLLAVA
Official repo for StableLLAVA
☆95Updated last year
invictus717 / InteractiveVideo
InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions
☆131Updated last year
will-singularity / Skywork-MM
Empirical Study Towards Building An Effective Multi-Modal Large Language Model
☆22Updated 2 years ago
OpenGVLab / MM-Interleaved
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
☆247Updated last year
hlchen23 / ADPN-MM
Repository for 23'MM accepted paper "Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Groundi…
☆52Updated last year
bytedance / Shot2Story
A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.
☆159Updated 10 months ago
cnzzx / VSA
Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines
☆128Updated last year
DLYuanGod / ArtGPT-4
Artistic Vision-Language Understanding with Adapter-enhanced MiniGPT-4
☆28Updated 2 years ago
vaew / SkyScript-100M
SkyScript-100M: 1,000,000,000 Pairs of Scripts and Shooting Scripts for Short Drama: https://arxiv.org/abs/2408.09333v2
☆129Updated last year
mulanai / MuLan
MuLan: Adapting Multilingual Diffusion Models for 110+ Languages (无需额外训练为任意扩散模型支持多语言能力)
☆143Updated 10 months ago
SparksJoe / Prism
A Framework for Decoupling and Assessing the Capabilities of VLMs
☆43Updated last year
mutonix / Vript
☆156Updated 10 months ago
farewellthree / PPLLaVA
Official GPU implementation of the paper "PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance"
☆130Updated last year
mengcye / LAION-SG
☆56Updated 7 months ago
OpenGVLab / ControlLLM
ControlLLM: Augment Language Models with Tools by Searching on Graphs
☆194Updated last year
hyc2026 / StoryTeller
☆80Updated 8 months ago
MBZUAI-LLM / web2code
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
☆97Updated last year
bytedance / Valley
Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.
☆263Updated last week
artemisp / LAVIS-XInstructBLIP
LAVIS - A One-stop Library for Language-Vision Intelligence
☆48Updated last year
ggg0919 / cantor
☆90Updated last year
mshukor / UnIVAL
[TMLR23] Official implementation of UnIVAL: Unified Model for Image, Video, Audio and Language Tasks.
☆232Updated last year
modelscope / lite-sora
An initiative to replicate Sora
☆104Updated last year
thunlp / Migician
[ACL2025 Findings] Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
☆81Updated 6 months ago