SWE-Perf / SWE-PerfLinks
β37Updated last month
Alternatives and similar repositories for SWE-Perf
Users that are interested in SWE-Perf are comparing it to the libraries listed below
Sorting:
- SWE-Swiss: A Multi-Task Fine-Tuning and RL Recipe for High-Performance Issue Resolutionβ88Updated 2 weeks ago
- Must-read papers on Repository-level Code Generation & Issue Resolution π₯β171Updated this week
- β32Updated 3 weeks ago
- The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".β82Updated last year
- LeetCode Training and Evaluation Datasetβ35Updated 5 months ago
- InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srwβ62Updated last year
- CodeRAG-Bench: Can Retrieval Augment Code Generation?β156Updated 10 months ago
- [NeurIPS 2025 D&B] π SWE-bench Goes Live!β123Updated last week
- Reproducing R1 for Code with Reliable Rewardsβ257Updated 5 months ago
- CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)β157Updated last month
- [COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agentsβ166Updated 2 months ago
- A Comprehensive Benchmark for Software Development.β113Updated last year
- β12Updated 2 months ago
- NaturalCodeBench (Findings of ACL 2024)β67Updated 11 months ago
- The official repository of the Omni-MATH benchmark.β88Updated 9 months ago
- Reproducing R1 for Code with Reliable Rewardsβ11Updated 5 months ago
- Revisiting Mid-training in the Era of Reinforcement Learning Scalingβ176Updated 2 months ago
- Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"β171Updated 4 months ago
- A comprehensive code domain benchmark review of LLM researches.β115Updated 2 weeks ago
- β127Updated last month
- Code, benchmark and environment for "ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows"β111Updated last month
- [NeurIPS'24] Official code for *π―DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*β114Updated 9 months ago
- A unified suite for generating elite reasoning problems and training high-performance LLMs, including pioneering attention-free architectβ¦β106Updated last week
- [ICML 2025] Teaching Language Models to Critique via Reinforcement Learningβ114Updated 5 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluationβ153Updated 11 months ago
- An Evolving Code Generation Benchmark Aligned with Real-world Code Repositoriesβ63Updated last year
- β53Updated last year
- eβ41Updated 5 months ago
- [EMNLP 2024 (Oral)] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QAβ138Updated 10 months ago
- [LREC-COLING'24] HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalizationβ38Updated 7 months ago