dhcode-cpp / grpo-lossLinks
☆39Updated 9 months ago
Alternatives and similar repositories for grpo-loss
Users that are interested in grpo-loss are comparing it to the libraries listed below
Sorting:
- Scaling Preference Data Curation via Human-AI Synergy☆132Updated 5 months ago
- ☆86Updated 4 months ago
- RLHF experiments on a single A100 40G GPU. Support PPO, GRPO, REINFORCE, RAFT, RLOO, ReMax, DeepSeek R1-Zero reproducing.☆76Updated 10 months ago
- Fantastic Data Engineering for Large Language Models☆93Updated 11 months ago
- ☆123Updated last year
- ☆162Updated 11 months ago
- A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details.☆221Updated 4 months ago
- Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning☆190Updated 9 months ago
- ☆50Updated last year
- ☆77Updated 10 months ago
- a toolkit on knowledge distillation for large language models☆221Updated last week
- OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning☆154Updated 11 months ago
- a-m-team's exploration in large language modeling☆195Updated 6 months ago
- This is a personal reimplementation of Google's Infini-transformer, utilizing a small 2b model. The project includes both model and train…☆58Updated last year
- ☆132Updated 7 months ago
- [ACL'24] Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning☆184Updated 5 months ago
- Pretrain、decay、SFT a CodeLLM from scratch 🧙♂️☆39Updated last year
- ☆147Updated last year
- This is a repo for showcasing using MCTS with LLMs to solve gsm8k problems☆93Updated last month
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning. COLM 2024 Accepted Paper☆33Updated last year
- PSFT is a trust-region–inspired fine-tuning objective that views SFT as a policy gradient method with constant advantages, constraining p…☆34Updated 3 months ago
- Scaling Agentic Reinforcement Learning with a Multi-Turn, Multi-Task Framework☆154Updated this week
- ☆65Updated last year
- [ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)☆181Updated 10 months ago
- ☆174Updated 7 months ago
- 怎么训练一个LLM分词器☆154Updated 2 years ago
- Pre-trained, Scalable, High-performance Reward Models via Policy Discriminative Learning.☆163Updated 2 months ago
- 本项目用于大模型数学解题能力方面的数据集合成,模型训练及评测,相关文章记录。☆98Updated last year
- MiroRL is an MCP-first reinforcement learning framework for deep research agent.☆183Updated 3 months ago
- The related works and background techniques about Openai o1☆221Updated 11 months ago