open-thought / tiny-grpoLinks
Minimal hackable GRPO implementation
β247Updated 4 months ago
Alternatives and similar repositories for tiny-grpo
Users that are interested in tiny-grpo are comparing it to the libraries listed below
Sorting:
- β782Updated last month
- πΎ OAT: A research-friendly framework for LLM online alignment, including reinforcement learning, preference learning, etc.β383Updated 2 weeks ago
- Tina: Tiny Reasoning Models via LoRAβ260Updated 3 weeks ago
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search (NeurIPS 2024)β639Updated 5 months ago
- Official repo for paper: "Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't"β238Updated last month
- SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learningβ422Updated this week
- β220Updated last month
- Large Reasoning Modelsβ804Updated 6 months ago
- β300Updated 3 weeks ago
- β203Updated 4 months ago
- Understanding R1-Zero-Like Training: A Critical Perspectiveβ991Updated last month
- nanoGRPO is a lightweight implementation of Group Relative Policy Optimization (GRPO)β105Updated last month
- β331Updated 2 weeks ago
- Deepseek R1 zero tiny version own reproduce on two A100s.β67Updated 4 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β339Updated 6 months ago
- Official Implementation for the paper "d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning"β213Updated last week
- (ICML 2024) Alphazero-like Tree-Search can guide large language model decoding and trainingβ276Updated last year
- A version of verl to support tool useβ251Updated last week
- This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.β318Updated 10 months ago
- minimal GRPO implementation from scratchβ90Updated 3 months ago
- A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.β241Updated 2 months ago
- Research Code for "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL"β179Updated 2 months ago
- Code for the paper: "Learning to Reason without External Rewards"β295Updated last week
- β540Updated 5 months ago
- RLHF implementation details of OAI's 2019 codebaseβ187Updated last year
- TTRL: Test-Time Reinforcement Learningβ650Updated 2 weeks ago
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learningβ222Updated last month
- A series of technical report on Slow Thinking with LLMβ699Updated 2 weeks ago
- Notes and commented code for RLHF (PPO)β96Updated last year
- Official codebase for "Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling".β264Updated 4 months ago