G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning
☆99May 20, 2025Updated 9 months ago
Alternatives and similar repositories for G1
Users that are interested in G1 are comparing it to the libraries listed below
Sorting:
- ☆11Aug 7, 2025Updated 6 months ago
- ☆64Feb 4, 2026Updated last month
- ☆21May 3, 2025Updated 10 months ago
- [ICML 2024] Official Repository for the paper "Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models"☆10Jul 19, 2024Updated last year
- [NAACL 2025] Source code for MMEvalPro, a more trustworthy and efficient benchmark for evaluating LMMs☆24Sep 26, 2024Updated last year
- This repository contains the code for the paper - "Aligning Text, Images, and 3D Structure Token-by-Token" (CVPR 2026)☆44Jun 11, 2025Updated 8 months ago
- MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision☆27May 26, 2025Updated 9 months ago
- ☆31Jul 16, 2025Updated 7 months ago
- Code repo for the paper: Attacking Vision-Language Computer Agents via Pop-ups☆51Dec 23, 2024Updated last year
- OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents☆21Jan 6, 2026Updated last month
- ☆25Feb 27, 2023Updated 3 years ago
- Code for "SCL-RAI: Span-based Contrastive Learning with Retrieval Augmented Inference for Unlabeled Entity Problem in NER" @COLING-2022☆11Aug 20, 2022Updated 3 years ago
- Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions☆22Feb 11, 2026Updated 3 weeks ago
- A PyTorch implementation of the paper "MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis".☆12Jan 16, 2023Updated 3 years ago
- SFT+RL boosts multimodal reasoning☆46Jun 27, 2025Updated 8 months ago
- ☆31Sep 1, 2025Updated 6 months ago
- [ICLR 2025] Adaptive prompt tailored pruning of T2I diffusion models.☆15Feb 1, 2025Updated last year
- ☆13Jul 10, 2024Updated last year
- ☆11Mar 3, 2025Updated last year
- [ICCV 2025] Diffusion Curriculum (DisCL)☆18Sep 26, 2025Updated 5 months ago
- [NeurIPS 2025] A multimodal agent that can interact with its own PC in a multimodal manner.☆35Feb 25, 2026Updated last week
- ☆39Aug 4, 2025Updated 7 months ago
- CoV: Chain-of-View Prompting for Spatial Reasoning☆51Jan 23, 2026Updated last month
- ☆16May 13, 2025Updated 9 months ago
- ☆19Jun 4, 2025Updated 9 months ago
- UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation☆18Aug 12, 2025Updated 6 months ago
- The official implementation of our ICCV 2023 publication, C-VisDiT☆10Oct 23, 2024Updated last year
- The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism☆30Jul 17, 2024Updated last year
- ☆21Jul 25, 2025Updated 7 months ago
- AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation☆16Aug 3, 2025Updated 7 months ago
- Findings of EMNLP 2023: InfoCL: Alleviating Catastrophic Forgetting in Continual Text Classification from An Information Theoretic Perspe…☆14Aug 13, 2024Updated last year
- ☆23Jul 20, 2025Updated 7 months ago
- ☆13May 9, 2023Updated 2 years ago
- [ACL'25 (Findings)] Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents☆26Feb 17, 2026Updated 2 weeks ago
- Official implementation of "OpenCity3D: What do Vision-Language Models know about Urban Environments?" @ WACV2025☆16Nov 24, 2024Updated last year
- Your efficient and accurate answer verification system for RL training.☆41Jun 23, 2025Updated 8 months ago
- Official implementation of "Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation" (CVPR 202…☆40May 26, 2025Updated 9 months ago
- From Word to World: Can Large Language Models be Implicit Text-based World Models?☆48Dec 25, 2025Updated 2 months ago
- SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward☆92Aug 8, 2025Updated 6 months ago