laude-institute / terminal-benchLinks
A benchmark for LLMs on complicated tasks in the terminal
☆177Updated this week
Alternatives and similar repositories for terminal-bench
Users that are interested in terminal-bench are comparing it to the libraries listed below
Sorting:
- Scaling Data for SWE-agents☆256Updated this week
- RepoQA: Evaluating Long-Context Code Understanding☆109Updated 7 months ago
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆173Updated 3 months ago
- SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learning☆422Updated this week
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆219Updated last month
- Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆76Updated 2 weeks ago
- Reproducing R1 for Code with Reliable Rewards☆221Updated last month
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]☆486Updated last month
- Async pipelined version of Verl☆100Updated 2 months ago
- ☆180Updated 2 months ago
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆205Updated 2 weeks ago
- Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory☆62Updated last month
- SWE Arena☆34Updated 2 months ago
- Replicating O1 inference-time scaling laws☆87Updated 6 months ago
- ☆127Updated 3 months ago
- ☆115Updated 4 months ago
- Reproducible, flexible LLM evaluations☆213Updated last month
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆145Updated 8 months ago
- 🚀 SWE-bench Goes Live!☆80Updated this week
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆183Updated this week
- ☆96Updated 8 months ago
- A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning☆184Updated this week
- A simple unified framework for evaluating LLMs☆219Updated 2 months ago
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆215Updated last month
- ☆65Updated last year
- The HELMET Benchmark☆154Updated 2 months ago
- GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents☆25Updated 3 weeks ago
- Evaluation of LLMs on latest math competitions☆136Updated last month
- Can Language Models Solve Olympiad Programming?☆116Updated 5 months ago
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆173Updated 5 months ago