SWE-Perf / SWE-PerfLinks
☆34Updated 2 weeks ago
Alternatives and similar repositories for SWE-Perf
Users that are interested in SWE-Perf are comparing it to the libraries listed below
Sorting:
- ☆77Updated last week
- ☆32Updated 2 months ago
- Reproducing R1 for Code with Reliable Rewards☆251Updated 3 months ago
- 🚀 SWE-bench Goes Live!☆112Updated last month
- InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw☆62Updated 10 months ago
- [COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆148Updated last month
- A Comprehensive Benchmark for Software Development.☆113Updated last year
- LeetCode Training and Evaluation Dataset☆30Updated 4 months ago
- The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".☆81Updated last year
- Revisiting Mid-training in the Era of Reinforcement Learning Scaling☆167Updated last month
- Code, benchmark and environment for "ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows"☆102Updated last week
- [ICML 2025] Teaching Language Models to Critique via Reinforcement Learning☆108Updated 3 months ago
- ☆45Updated last week
- Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"☆170Updated 3 months ago
- [COLM'25] Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?☆34Updated 2 months ago
- [NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*☆112Updated 8 months ago
- NaturalCodeBench (Findings of ACL 2024)☆67Updated 10 months ago
- Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving☆238Updated last week
- Must-read papers on Repository-level Code Generation & Issue Resolution 🔥☆148Updated last week
- Mind2Web-2 Benchmark: Evaluating Agentic Search with Agent-as-a-Judge☆73Updated last month
- ☆52Updated last year
- General Reasoner: Advancing LLM Reasoning Across All Domains☆163Updated 2 months ago
- Collections of RLxLM experiments using minimal codes☆13Updated 6 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆152Updated 10 months ago
- CodeRAG-Bench: Can Retrieval Augment Code Generation?☆153Updated 9 months ago
- SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner☆26Updated last month
- CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)☆155Updated last week
- A version of verl to support tool use☆341Updated this week
- Official repository for our paper "FullStack Bench: Evaluating LLMs as Full Stack Coders"☆100Updated 3 months ago
- ☆127Updated last week