OpenDevin / OD-SWE-bench
Enhanced fork of SWE-bench, tailored for OpenDevin's ecosystem.
☆20Updated 5 months ago
Related projects ⓘ
Alternatives and complementary repositories for OD-SWE-bench
- Harness used to benchmark aider against SWE Bench benchmarks☆52Updated 4 months ago
- Aider's refactoring benchmark exercises based on popular python repos☆44Updated last month
- Contains the model patches and the eval logs from the passing swe-bench-lite run.☆10Updated 4 months ago
- Enhancing AI Software Engineering with Repository-level Code Graph☆92Updated 2 months ago
- LangChain + LiteLLM that works☆24Updated last week
- ☆31Updated 2 weeks ago
- Resources for our paper: "EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms"☆75Updated 3 weeks ago
- Data preparation code for CrystalCoder 7B LLM☆42Updated 6 months ago
- ☆80Updated 3 months ago
- ☆152Updated 2 months ago
- Unleash the full potential of exascale LLMs on consumer-class GPUs, proven by extensive benchmarks, with no long-term adjustments and min…☆23Updated last week
- RepoQA: Evaluating Long-Context Code Understanding☆99Updated last week
- A QT GUI for large language models☆24Updated 10 months ago
- Contains the prompts we use to talk to various LLMs for different utilities inside the editor☆61Updated 9 months ago
- Building Open LLM Web Agents with Self-Evolving Online Curriculum RL☆166Updated this week
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆206Updated last month
- Open Agent Computer Interface☆38Updated last month
- Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents☆109Updated 4 months ago
- Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement☆45Updated 2 weeks ago
- Small and Efficient Mathematical Reasoning LLMs☆71Updated 9 months ago
- ☆76Updated 10 months ago
- ☆27Updated 4 months ago
- Evaluating tool-augmented LLMs in conversation settings☆72Updated 5 months ago
- Advancing LLM with Diverse Coding Capabilities☆51Updated 3 months ago
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆99Updated this week
- WebLINX is a benchmark for building web navigation agents with conversational capabilities☆116Updated last month
- ☆41Updated 2 months ago
- Data and evaluation scripts for "CodePlan: Repository-level Coding using LLMs and Planning", FSE 2024☆51Updated 2 months ago
- Beating the GAIA benchmark with Transformers Agents. 🚀☆62Updated 2 weeks ago