open-thought / tiny-grpo
Minimal hackable GRPO implementation
☆188Updated last month
Alternatives and similar repositories for tiny-grpo:
Users that are interested in tiny-grpo are comparing it to the libraries listed below
- nanoGRPO is a lightweight implementation of Group Relative Policy Optimization (GRPO)☆91Updated this week
- ☆559Updated 2 weeks ago
- RLHF implementation details of OAI's 2019 codebase☆184Updated last year
- ☆136Updated 4 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, spars…☆310Updated 3 months ago
- This is a repo for showcasing using MCTS with LLMs to solve gsm8k problems☆67Updated last week
- R1-searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning☆376Updated last week
- A visuailzation tool to make deep understaning and easier debugging for RLHF training.☆177Updated last month
- Official codebase for "Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling".☆231Updated last month
- 🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.☆283Updated last week
- (ICML 2024) Alphazero-like Tree-Search can guide large language model decoding and training☆260Updated 10 months ago
- ☆262Updated last week
- Large Reasoning Models☆800Updated 3 months ago
- ☆507Updated 2 months ago
- This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.☆299Updated 7 months ago
- A series of technical report on Slow Thinking with LLM☆595Updated last week
- A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details.☆166Updated last week
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search (NeurIPS 2024)☆597Updated 2 months ago
- Exploring Applications of GRPO☆109Updated last month
- [NeurIPS 2024] SimPO: Simple Preference Optimization with a Reference-Free Reward☆851Updated last month
- ☆325Updated last month
- RLHF experiments on a single A100 40G GPU. Support PPO, GRPO, REINFORCE, RAFT, RLOO, ReMax, DeepSeek R1-Zero reproducing.☆48Updated last month
- ☆116Updated 9 months ago
- ☆485Updated last week
- TransMLA: Multi-Head Latent Attention Is All You Need☆221Updated 3 weeks ago
- Notes and commented code for RLHF (PPO)☆79Updated last year
- Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper☆129Updated 8 months ago
- ☆166Updated last month
- Micro Llama is a small Llama based model with 300M parameters trained from scratch with $500 budget☆145Updated last year
- AN O1 REPLICATION FOR CODING☆329Updated 3 months ago