laude-institute / terminal-benchLinks
A benchmark for LLMs on complicated tasks in the terminal
☆240Updated this week
Alternatives and similar repositories for terminal-bench
Users that are interested in terminal-bench are comparing it to the libraries listed below
Sorting:
- Scaling Data for SWE-agents☆293Updated this week
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]☆498Updated 2 months ago
- r2e: turn any github repository into a programming agent environment☆126Updated 2 months ago
- Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆112Updated last week
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆173Updated 4 months ago
- RepoQA: Evaluating Long-Context Code Understanding☆109Updated 8 months ago
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆191Updated this week
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆223Updated 2 months ago
- A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning☆207Updated this week
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆221Updated 2 months ago
- A simple unified framework for evaluating LLMs☆221Updated 3 months ago
- Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"☆566Updated 3 months ago
- SWE Arena☆34Updated last week
- Async pipelined version of Verl☆106Updated 3 months ago
- A benchmark that challenges language models to code solutions for scientific problems☆127Updated last week
- CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings☆44Updated 5 months ago
- Evaluation of LLMs on latest math competitions☆140Updated 2 months ago
- 🚀 SWE-bench Goes Live!☆91Updated this week
- SkyRL: A Modular Full-stack RL Library for LLMs☆574Updated this week
- Reproducible, flexible LLM evaluations☆219Updated this week
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆221Updated last year
- ☆234Updated 11 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆148Updated 9 months ago
- Reproducing R1 for Code with Reliable Rewards☆232Updated 2 months ago
- Open source interpretability artefacts for R1.☆154Updated 2 months ago
- ☆41Updated 5 months ago
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆242Updated last week
- ☆104Updated 2 months ago
- [ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI☆395Updated 3 months ago
- ☆93Updated last month