Yu-xm / UnicornLinks
Text-Only Data Synthesis for Vision Language Model Training
☆22Updated 6 months ago
Alternatives and similar repositories for Unicorn
Users that are interested in Unicorn are comparing it to the libraries listed below
Sorting:
- ☆39Updated 6 months ago
- ☆135Updated last month
- https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT☆106Updated last month
- [NeurIPS25] Official Implementation (Pytorch) of "DeepVideo-R1"☆30Updated 3 weeks ago
- Official Implementation of OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation☆36Updated 5 months ago
- ☆64Updated 5 months ago
- This is an early exploration to introduce Interleaving Reasoning to Text-to-image Generation field and achieve the SoTA benchmark perform…☆79Updated 2 months ago
- Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?☆80Updated 5 months ago
- [ECCV'24 Oral] PiTe: Pixel-Temporal Alignment for Large Video-Language Model☆17Updated 10 months ago
- A unified framework for controllable caption generation across images, videos, and audio. Supports multi-modal inputs and customizable ca…☆52Updated 4 months ago
- ICML2025☆61Updated 3 months ago
- ☆15Updated 6 months ago
- Thinking with Videos from Open-Source Priors. We reproduce chain-of-frames visual reasoning by fine-tuning open-source video models. Give…☆191Updated 2 months ago
- Official implement of MIA-DPO☆67Updated 10 months ago
- [arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation☆94Updated 9 months ago
- Official implementation of "Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence"☆120Updated last month
- High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning☆52Updated 4 months ago
- [ICCV 2025 Oral] Official implementation of Learning Streaming Video Representation via Multitask Training.☆71Updated 3 weeks ago
- ☆95Updated 5 months ago
- ☆56Updated 7 months ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆62Updated 4 months ago
- ☆79Updated 5 months ago
- [NeurIPS 2024] The official implement of research paper "FreeLong : Training-Free Long Video Generation with SpectralBlend Temporal Atten…☆64Updated 5 months ago
- [NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration☆25Updated last year
- Code for Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? [COLM 2024]☆24Updated last year
- A Comprehensive Dataset for Advanced Image Generation and Editing}☆30Updated 2 months ago
- Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision☆177Updated 2 weeks ago
- [NeurIPS 2024 D&B Track] Official Repo for "LVD-2M: A Long-take Video Dataset with Temporally Dense Captions"☆73Updated last year
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆37Updated 5 months ago
- (ICCV2025) Official repository of paper "ViSpeak: Visual Instruction Feedback in Streaming Videos"☆41Updated 5 months ago