THUDM / SWE-DevLinks
[ACL25' Findings] SWE-Dev is an SWE agent with a scalable test case construction pipeline.
β32Updated 3 weeks ago
Alternatives and similar repositories for SWE-Dev
Users that are interested in SWE-Dev are comparing it to the libraries listed below
Sorting:
- Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agentsβ73Updated last month
- π SWE-bench Goes Live!β24Updated last week
- Reproducing R1 for Code with Reliable Rewardsβ208Updated last month
- Moatless Testbeds allows you to create isolated testbed environments in a Kubernetes cluster where you can apply code changes through gitβ¦β12Updated last month
- RepoQA: Evaluating Long-Context Code Understandingβ108Updated 7 months ago
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasksβ208Updated last month
- β38Updated 5 months ago
- β46Updated last year
- Systematic evaluation framework that automatically rates overthinking behavior in large language models.β89Updated 3 weeks ago
- [ICML 2025] Teaching Language Models to Critique via Reinforcement Learningβ98Updated last month
- Async pipelined version of Verlβ91Updated last month
- β92Updated 3 weeks ago
- β83Updated last month
- This is the official implementation for paper "PENCIL: Long Thoughts with Short Memory".β45Updated 3 weeks ago
- r2e: turn any github repository into a programming agent environmentβ124Updated last month
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scalingβ104Updated 4 months ago
- Official Repo for InSTA: Towards Internet-Scale Training For Agentsβ42Updated last week
- β67Updated 2 months ago
- [ICML 2025] Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examplesβ89Updated last week
- Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systemsβ91Updated 3 months ago
- Code for Paper: Learning Adaptive Parallel Reasoning with Language Modelsβ96Updated last month
- β49Updated 3 weeks ago
- DSBench: How Far are Data Science Agents from Becoming Data Science Experts?β54Updated 3 months ago
- A benchmark for LLMs on complicated tasks in the terminalβ141Updated this week
- β102Updated 6 months ago
- Official code for the paper "CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules"β45Updated 4 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluationβ141Updated 7 months ago
- β61Updated 7 months ago
- Training and Benchmarking LLMs for Code Preference.β33Updated 6 months ago
- Framework and toolkits for building and evaluating collaborative agents that can work together with humans.β81Updated last month