laude-institute / terminal-benchLinks
A benchmark for LLMs on complicated tasks in the terminal
☆322Updated this week
Alternatives and similar repositories for terminal-bench
Users that are interested in terminal-bench are comparing it to the libraries listed below
Sorting:
- Scaling Data for SWE-agents☆328Updated this week
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]☆513Updated this week
- Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆136Updated 3 weeks ago
- Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"☆573Updated 4 months ago
- A simple unified framework for evaluating LLMs☆235Updated 3 months ago
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆273Updated last week
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆197Updated 3 weeks ago
- SWE Arena☆33Updated 3 weeks ago
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆175Updated 4 months ago
- Evaluation of LLMs on latest math competitions☆155Updated 2 weeks ago
- r2e: turn any github repository into a programming agent environment☆129Updated 3 months ago
- RepoQA: Evaluating Long-Context Code Understanding☆113Updated 9 months ago
- A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning☆225Updated this week
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆233Updated 2 months ago
- Open source interpretability artefacts for R1.☆157Updated 3 months ago
- Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving☆226Updated last week
- [ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI☆409Updated 3 months ago
- Reproducing R1 for Code with Reliable Rewards☆243Updated 2 months ago
- ☆99Updated last month
- A benchmark that challenges language models to code solutions for scientific problems☆127Updated last week
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆151Updated 9 months ago
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆232Updated 2 months ago
- Commit0: Library Generation from Scratch☆160Updated 2 months ago
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆223Updated last year
- Reproducible, flexible LLM evaluations☆226Updated 3 weeks ago
- ☆248Updated last week
- CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings☆47Updated 6 months ago
- Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory☆69Updated 2 months ago
- PyTorch building blocks for the OLMo ecosystem☆269Updated this week
- SkyRL: A Modular Full-stack RL Library for LLMs☆679Updated this week