stepfun-ai / Step3-VL-10BLinks
Step3-VL-10B: A compact yet frontier multimodal model achieving SOTA performance at the 10B scale, matching open-source models 10-20x its size.
☆390Updated 3 weeks ago
Alternatives and similar repositories for Step3-VL-10B
Users that are interested in Step3-VL-10B are comparing it to the libraries listed below
Sorting:
- [ICLR'26] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs☆97Updated 2 weeks ago
- (ICLR 2026) An official implementation of "CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning"☆186Updated this week
- DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models☆169Updated last month
- OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.☆634Updated 3 months ago
- ☆517Updated 2 weeks ago
- [ArXiv 2025] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models☆128Updated last month
- NextFlow🚀: Unified Sequential Modeling Activates Multimodal Understanding and Generation☆309Updated last month
- [MTI-LLM@NeurIPS 2025] Official implementation of "PyVision: Agentic Vision with Dynamic Tooling."☆147Updated 6 months ago
- [🚀 ICLR 2026 Oral]NextStep-1: SOTA Autogressive Image Generation with Continuous Tokens. A research project developed by the StepFun’s M…☆602Updated last month
- ACL 2025: Synthetic data generation pipelines for text-rich images.☆155Updated 11 months ago
- The official repository of "R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Integration"☆136Updated 5 months ago
- Fully Open Framework for Democratized Multimodal Training☆718Updated last month
- A Scientific Multimodal Foundation Model☆706Updated last week
- [ACL2025 Oral & Award] Evaluate Image/Video Generation like Humans - Fast, Explainable, Flexible☆121Updated 6 months ago
- InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models☆84Updated last week
- PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning☆313Updated last week
- ☆63Updated 7 months ago
- Official Implementation of LaViDa: :A Large Diffusion Language Model for Multimodal Understanding☆194Updated last month
- ☆37Updated 2 months ago
- UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning☆157Updated 8 months ago
- Official codes of "Monet: Reasoning in Latent Visual Space Beyond Image and Language"☆125Updated last week
- Official Code for "ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning"☆79Updated 2 months ago
- Official implementation of "Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence"☆129Updated last month
- Official inference code and LongText-Bench benchmark for our paper X-Omni (https://arxiv.org/pdf/2507.22058).☆420Updated 5 months ago
- ☆107Updated 8 months ago
- StreamingVLM: Real-Time Understanding for Infinite Video Streams☆872Updated 3 months ago
- An unified model that seamlessly integrates multimodal understanding, text-to-image generation, and image editing within a single powerfu…☆449Updated 2 months ago
- Official implementation of Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning☆223Updated this week
- 🚀ReVisual-R1 is a 7B open-source multimodal language model that follows a three-stage curriculum—cold-start pre-training, multimodal rei…☆196Updated 2 months ago
- Official implementation of GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization☆374Updated last month