xyliu-cs / RISELinks
Official Implementation of RISE (Reinforcing Reasoning with Self-Verification)
☆28Updated last week
Alternatives and similar repositories for RISE
Users that are interested in RISE are comparing it to the libraries listed below
Sorting:
- ☆27Updated 3 weeks ago
- StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback☆67Updated 10 months ago
- A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.☆61Updated last month
- Training and Benchmarking LLMs for Code Preference.☆33Updated 8 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆148Updated 9 months ago
- InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw☆61Updated 9 months ago
- A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.☆55Updated 8 months ago
- ☆35Updated 2 years ago
- Code for the TMLR 2023 paper "PPOCoder: Execution-based Code Generation using Deep Reinforcement Learning"☆113Updated last year
- xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval☆84Updated 9 months ago
- [ICML 2025] Teaching Language Models to Critique via Reinforcement Learning☆102Updated 2 months ago
- ☆31Updated 3 weeks ago
- ☆28Updated 9 months ago
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following☆127Updated last year
- XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts☆33Updated last year
- Interpretable Contrastive Monte Carlo Tree Search Reasoning☆49Updated 8 months ago
- [NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI☆102Updated 4 months ago
- CodeUltraFeedback: aligning large language models to coding preferences☆71Updated last year
- ☆48Updated last year
- RepoQA: Evaluating Long-Context Code Understanding☆109Updated 8 months ago
- The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".☆78Updated last year
- [ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆150Updated last year
- Benchmarking LLMs' Emotional Alignment with Humans☆104Updated 5 months ago
- Astraios: Parameter-Efficient Instruction Tuning Code Language Models☆58Updated last year
- RL Scaling and Test-Time Scaling (ICML'25)☆108Updated 5 months ago
- R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (EMNLP Findings 2024)☆80Updated 2 months ago
- Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆112Updated last week
- e☆38Updated 2 months ago
- [NeurIPS 2024] Evaluation harness for SWT-Bench, a benchmark for evaluating LLM repository-level test-generation☆51Updated last month
- Evaluate the Quality of Critique☆36Updated last year