laude-institute / terminal-benchLinks
A benchmark for LLMs on complicated tasks in the terminal
☆141Updated this week
Alternatives and similar repositories for terminal-bench
Users that are interested in terminal-bench are comparing it to the libraries listed below
Sorting:
- Scaling Data for SWE-agents☆220Updated this week
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆173Updated 2 months ago
- Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory☆61Updated last week
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆207Updated 3 weeks ago
- Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆73Updated last month
- SWE Arena☆33Updated last month
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆206Updated 3 weeks ago
- Replicating O1 inference-time scaling laws☆87Updated 6 months ago
- Systematic evaluation framework that automatically rates overthinking behavior in large language models.☆89Updated 2 weeks ago
- Reproducing R1 for Code with Reliable Rewards☆201Updated 3 weeks ago
- A scalable asynchronous reinforcement learning implementation with in-flight weight updates.☆119Updated this week
- ☆114Updated 3 months ago
- SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learning☆343Updated last week
- Commit0: Library Generation from Scratch☆149Updated 3 weeks ago
- General Reasoner: Advancing LLM Reasoning Across All Domains☆126Updated this week
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆140Updated 7 months ago
- RepoQA: Evaluating Long-Context Code Understanding☆108Updated 7 months ago
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆201Updated 3 weeks ago
- A version of verl to support tool use☆172Updated this week
- Async pipelined version of Verl☆91Updated last month
- ☆81Updated 6 months ago
- Evaluation of LLMs on latest math competitions☆129Updated 2 weeks ago
- ☆67Updated 2 months ago
- r2e: turn any github repository into a programming agent environment☆121Updated last month
- Evaluating LLMs with fewer examples☆155Updated last year
- A simple unified framework for evaluating LLMs☆215Updated last month
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]☆477Updated 3 weeks ago
- A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning☆169Updated this week
- [NeurIPS 2024] Can LLMs Learn by Teaching for Better Reasoning? A Preliminary Study☆49Updated 6 months ago
- Official Repo for InSTA: Towards Internet-Scale Training For Agents☆42Updated this week