CognitionAI / devin-swebench-results
Cognition's results and methodology on SWE-bench
☆118Updated 8 months ago
Related projects ⓘ
Alternatives and complementary repositories for devin-swebench-results
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆103Updated this week
- Harness used to benchmark aider against SWE Bench benchmarks☆53Updated 4 months ago
- ☆103Updated 3 months ago
- ☆153Updated 2 months ago
- Aider's refactoring benchmark exercises based on popular python repos☆45Updated last month
- Just a bunch of benchmark logs for different LLMs☆116Updated 3 months ago
- r2e: turn any github repository into a programming agent environment☆89Updated 3 weeks ago
- Enhancing AI Software Engineering with Repository-level Code Graph☆96Updated 2 months ago
- Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions☆40Updated 3 months ago
- Track the progress of LLM context utilisation☆53Updated 4 months ago
- Beating the GAIA benchmark with Transformers Agents. 🚀☆63Updated 3 weeks ago
- Mixing Language Models with Self-Verification and Meta-Verification☆97Updated last year
- 🔧 Compare how Agent systems perform on several benchmarks. 📊🚀☆47Updated last month
- Evaluating tool-augmented LLMs in conversation settings☆72Updated 5 months ago
- Accepted by Transactions on Machine Learning Research (TMLR)☆120Updated last month
- Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement☆47Updated 3 weeks ago
- ☆82Updated 4 months ago
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive arguments☆63Updated last month
- A new benchmark for measuring LLM's capability to detect bugs in large codebase.☆27Updated 5 months ago
- Code for the paper 🌳 Tree Search for Language Model Agents☆140Updated 3 months ago
- Evaluating LLMs with CommonGen-Lite☆85Updated 8 months ago
- Public Inflection Benchmarks☆69Updated 8 months ago
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆194Updated 6 months ago
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆222Updated last month
- InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw☆52Updated last month
- Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platform☆81Updated this week
- A set of utilities for running few-shot prompting experiments on large-language models☆113Updated last year
- ☆66Updated 2 months ago
- Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents☆111Updated 5 months ago
- Can Language Models Solve Olympiad Programming?☆101Updated 3 months ago