Yu-xm / UnicornLinks
Text-Only Data Synthesis for Vision Language Model Training
☆23Updated 7 months ago
Alternatives and similar repositories for Unicorn
Users that are interested in Unicorn are comparing it to the libraries listed below
Sorting:
- Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?☆86Updated 6 months ago
- ☆39Updated 8 months ago
- ☆21Updated last month
- This is an early exploration to introduce Interleaving Reasoning to Text-to-image Generation field and achieve the SoTA benchmark perform…☆84Updated 4 months ago
- [NeurIPS25] Official Implementation (Pytorch) of "DeepVideo-R1"☆31Updated 2 months ago
- ☆141Updated 3 months ago
- High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning☆52Updated 6 months ago
- [ECCV'24 Oral] PiTe: Pixel-Temporal Alignment for Large Video-Language Model☆17Updated 11 months ago
- E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models☆24Updated 3 weeks ago
- ☆56Updated 9 months ago
- ☆63Updated 6 months ago
- ICML2025☆63Updated 4 months ago
- [NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration☆26Updated last year
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆64Updated 6 months ago
- [NeurIPS 2024] The official implement of research paper "FreeLong : Training-Free Long Video Generation with SpectralBlend Temporal Atten…☆63Updated 6 months ago
- [NeurIPS 2024] EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models.☆51Updated last year
- FQGAN: Factorized Visual Tokenization and Generation☆57Updated 9 months ago
- [NeurIPS 2024 D&B Track] Official Repo for "LVD-2M: A Long-take Video Dataset with Temporally Dense Captions"☆76Updated last year
- The code repository of UniRL☆50Updated 7 months ago
- [ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark☆138Updated 7 months ago
- Official implementation of Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents (NeurIPS 2025)☆44Updated 2 months ago
- Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision☆196Updated last month
- This repository provides an improved LLamaGen Model, fine-tuned on 500,000 high-quality images, each accompanied by over 300 token prompt…☆30Updated last year
- Quick Long Video Understanding [TMLR2025]☆74Updated 2 months ago
- https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT☆114Updated 2 months ago
- Code for Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? [COLM 2024]☆24Updated last year
- On Path to Multimodal Generalist: General-Level and General-Bench☆19Updated 6 months ago
- Official Implementation of OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation☆37Updated 6 months ago
- TEMPURA enables video-language models to reason about causal event relationships and generate fine-grained, timestamped descriptions of u…☆24Updated 7 months ago
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆39Updated 7 months ago