aorwall / moatless-testbeds
Moatless Testbeds allows you to create isolated testbed environments in a Kubernetes cluster where you can apply code changes through git patches and run tests or SWE-Bench evaluations.
☆11Updated 3 weeks ago
Alternatives and similar repositories for moatless-testbeds:
Users that are interested in moatless-testbeds are comparing it to the libraries listed below
- Small, simple agent task environments for training and evaluation☆18Updated 6 months ago
- Training and Benchmarking LLMs for Code Preference.☆33Updated 5 months ago
- ☆21Updated last year
- ☆79Updated 2 weeks ago
- Aioli: A unified optimization framework for language model data mixing☆25Updated 3 months ago
- ☆24Updated 6 months ago
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment☆55Updated 8 months ago
- NeurIPS 2024 tutorial on LLM Inference☆43Updated 4 months ago
- Reasoning by Communicating with Agents☆27Updated last week
- [EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation☆47Updated last year
- ☆42Updated last month
- A framework for pitting LLMs against each other in an evolving library of games ⚔☆32Updated 2 weeks ago
- Simple and efficient pytorch-native transformer training and inference (batched)☆73Updated last year
- Script for processing OpenAI's PRM800K process supervision dataset into an Alpaca-style instruction-response format☆27Updated last year
- Implementation of "SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models"☆27Updated 2 months ago
- Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆64Updated 2 weeks ago
- ☆15Updated 3 weeks ago
- CodeUltraFeedback: aligning large language models to coding preferences☆71Updated 10 months ago
- ☆85Updated last week
- [NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI☆101Updated 2 months ago
- ☆42Updated 7 months ago
- 🔔🧠 Easily experiment with popular language agents across diverse reasoning/decision-making benchmarks!☆51Updated last month
- RepoQA: Evaluating Long-Context Code Understanding☆108Updated 6 months ago
- ☆60Updated last year
- Q-Probe: A Lightweight Approach to Reward Maximization for Language Models☆41Updated 10 months ago
- Data preparation code for CrystalCoder 7B LLM☆44Updated 11 months ago
- Minimum Description Length probing for neural network representations☆19Updated 3 months ago
- Codebase for Instruction Following without Instruction Tuning☆34Updated 7 months ago
- ☆44Updated 2 months ago
- [NAACL'25] "Revealing the Barriers of Language Agents in Planning"☆12Updated 6 months ago