gso-bench / gsoLinks
GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
☆27Updated 3 weeks ago
Alternatives and similar repositories for gso
Users that are interested in gso are comparing it to the libraries listed below
Sorting:
- Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆77Updated 2 weeks ago
- r2e: turn any github repository into a programming agent environment☆125Updated 2 months ago
- A scalable asynchronous reinforcement learning implementation with in-flight weight updates.☆127Updated this week
- RepoQA: Evaluating Long-Context Code Understanding☆109Updated 7 months ago
- ☆88Updated 3 weeks ago
- ☆26Updated last week
- Moatless Testbeds allows you to create isolated testbed environments in a Kubernetes cluster where you can apply code changes through git…☆13Updated 2 months ago
- Scaling Data for SWE-agents☆265Updated this week
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆173Updated 3 months ago
- ☆36Updated last month
- ☆41Updated 5 months ago
- A benchmark for LLMs on complicated tasks in the terminal☆208Updated this week
- ☆43Updated this week
- ☆87Updated 2 months ago
- InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw☆61Updated 8 months ago
- Training and Benchmarking LLMs for Code Preference.☆33Updated 7 months ago
- [EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation☆48Updated last year
- BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions☆23Updated 10 months ago
- Code for Paper: Learning Adaptive Parallel Reasoning with Language Models☆107Updated 2 months ago
- Replicating O1 inference-time scaling laws☆87Updated 6 months ago
- Can Language Models Solve Olympiad Programming?☆117Updated 5 months ago
- ☆96Updated 9 months ago
- Async pipelined version of Verl☆100Updated 2 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆145Updated 8 months ago
- ☆17Updated 5 months ago
- Code and Configs for Asynchronous RLHF: Faster and More Efficient RL for Language Models☆57Updated 2 months ago
- The evaluation framework for the InfiCoder-Eval benchmark.☆20Updated 11 months ago
- CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)☆143Updated 11 months ago
- ☆48Updated last month
- Python package for rematerialization-aware gradient checkpointing☆25Updated last year