Yu-xm / UnicornLinks

Text-Only Data Synthesis for Vision Language Model Training

☆23

Alternatives and similar repositories for Unicorn

Users that are interested in Unicorn are comparing it to the libraries listed below

Sorting:

TencentARC / Video-Holmes
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
☆86Updated 6 months ago
SJTU-DENG-Lab / UniCMs
☆39Updated 8 months ago
Vchitect / Uni-MMMU
☆21Updated last month
Osilly / Interleaving-Reasoning-Generation
This is an early exploration to introduce Interleaving Reasoning to Text-to-image Generation field and achieve the SoTA benchmark perform…
☆84Updated 4 months ago
mlvlab / DeepVideoR1
[NeurIPS25] Official Implementation (Pytorch) of "DeepVideo-R1"
☆31Updated 2 months ago
TencentARC / MindOmni
☆141Updated 3 months ago
EvolvingLMMs-Lab / MGPO
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
☆52Updated 6 months ago
yliu-cs / PiTe
[ECCV'24 Oral] PiTe: Pixel-Temporal Alignment for Large Video-Language Model
☆17Updated 11 months ago
shengjun-zhang / VisualGRPO
E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models
☆24Updated 3 weeks ago
mengcye / LAION-SG
☆56Updated 9 months ago
Tiezheng11 / Vision-Language-Vision
☆63Updated 6 months ago
Tencent / HaploVLM
ICML2025
☆63Updated 4 months ago
markywg / transagent
[NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration
☆26Updated last year
OpenGVLab / TPO
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
☆64Updated 6 months ago
aniki-ly / FreeLong
[NeurIPS 2024] The official implement of research paper "FreeLong : Training-Free Long Video Generation with SpectralBlend Temporal Atten…
☆63Updated 6 months ago
showlab / EvolveDirector
[NeurIPS 2024] EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models.
☆51Updated last year
showlab / FQGAN
FQGAN: Factorized Visual Tokenization and Generation
☆57Updated 9 months ago
SilentView / LVD-2M
[NeurIPS 2024 D&B Track] Official Repo for "LVD-2M: A Long-take Video Dataset with Temporally Dense Captions"
☆76Updated last year
showlab / UniRL
The code repository of UniRL
☆50Updated 7 months ago
rese1f / aurora
[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
☆138Updated 7 months ago
HL-hanlin / Bifrost-1
Official implementation of Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents (NeurIPS 2025)
☆44Updated 2 months ago
Fr0zenCrane / UniCoT
Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
☆196Updated last month
Owen718 / LongPrompt-LLamaGen
This repository provides an improved LLamaGen Model, fine-tuned on 500,000 high-quality images, each accompanied by over 300 token prompt…
☆30Updated last year
TIGER-AI-Lab / QuickVideo
Quick Long Video Understanding [TMLR2025]
☆74Updated 2 months ago
multimodal-reasoning-lab / Bagel-Zebra-CoT
https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT
☆114Updated 2 months ago
zeyofu / Commonsense-T2I
Code for Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? [COLM 2024]
☆24Updated last year
path2generalist / General-Level
On Path to Multimodal Generalist: General-Level and General-Bench
☆19Updated 6 months ago
LanceZPF / OpenING
Official Implementation of OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
☆37Updated 6 months ago
Andy-Cheng / TEMPURA
TEMPURA enables video-language models to reason about causal event relationships and generate fine-grained, timestamped descriptions of u…
☆24Updated 7 months ago
HaroldChen19 / VistaDPO
[ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
☆39Updated 7 months ago