scaleapi / SWE-bench_Pro-osLinks
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
☆251Updated 3 weeks ago
Alternatives and similar repositories for SWE-bench_Pro-os
Users that are interested in SWE-bench_Pro-os are comparing it to the libraries listed below
Sorting:
- Pivotal Token Search☆144Updated last month
- Harbor is a framework for running agent evaluations and creating and using RL environments.☆488Updated this week
- Matrix (Multi-Agent daTa geneRation Infra and eXperimentation framework) is a versatile engine for multi-agent conversational data genera…☆260Updated last week
- GRPO training code which scales to 32xH100s for long horizon terminal/coding tasks. Base agent is now the top Qwen3 agent on Stanford's T…☆342Updated 5 months ago
- [DAI 2025] Beyond GPT-5: Making LLMs Cheaper and Better via Performance–Efficiency Optimized Routing☆199Updated last month
- Coding problems used in aider's polyglot benchmark☆199Updated last year
- Train your own SOTA deductive reasoning model☆107Updated 10 months ago
- Code for our paper PAPILLON: PrivAcy Preservation from Internet-based and Local Language MOdel ENsembles☆61Updated 8 months ago
- ☆131Updated 8 months ago
- Storing long contexts in tiny caches with self-study☆231Updated last month
- ☆463Updated 2 months ago
- Data recipes and robust infrastructure for training AI agents☆84Updated last week
- Simple & Scalable Pretraining for Neural Architecture Research☆307Updated last month
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆189Updated 10 months ago
- ☆59Updated last year
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆415Updated last week
- Curated collection of community environments☆208Updated this week
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents☆532Updated this week
- a curated list of data for reasoning ai☆141Updated last year
- accompanying material for sleep-time compute paper☆119Updated 9 months ago
- j1-micro (1.7B) & j1-nano (600M) are absurdly tiny but mighty reward models.☆101Updated 6 months ago
- A clean, modular SDK for building AI agents with OpenHands V1.☆459Updated this week
- RepoQA: Evaluating Long-Context Code Understanding☆128Updated last year
- ☆256Updated 10 months ago
- ☆75Updated 7 months ago
- ☆115Updated last year
- Streamline on-policy/off-policy distillation workflows in a few lines of code☆94Updated this week
- ☆131Updated 7 months ago
- Reverse Engineering Gemma 3n: Google's New Edge-Optimized Language Model☆262Updated 8 months ago
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆69Updated last year