CognitionAI / devin-swebench-resultsLinks
Cognition's results and methodology on SWE-bench
☆120Updated last year
Alternatives and similar repositories for devin-swebench-results
Users that are interested in devin-swebench-results are comparing it to the libraries listed below
Sorting:
- Harness used to benchmark aider against SWE Bench benchmarks☆76Updated last year
- Evaluating LLMs with CommonGen-Lite☆91Updated last year
- Just a bunch of benchmark logs for different LLMs☆119Updated last year
- Mixing Language Models with Self-Verification and Meta-Verification☆110Updated 9 months ago
- Run SWE-bench evaluations remotely☆42Updated last month
- ☆112Updated 3 months ago
- 🔧 Compare how Agent systems perform on several benchmarks. 📊🚀☆102Updated last month
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆212Updated this week
- Pre-training code for CrystalCoder 7B LLM☆55Updated last year
- ☆124Updated last year
- ☆85Updated 2 years ago
- ☆99Updated last year
- ☆41Updated last year
- Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models☆97Updated last year
- A set of utilities for running few-shot prompting experiments on large-language models☆122Updated last year
- [ICML 2023] "Outline, Then Details: Syntactically Guided Coarse-To-Fine Code Generation", Wenqing Zheng, S P Sharan, Ajay Kumar Jaiswal, …☆41Updated last year
- Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents☆128Updated last year
- Track the progress of LLM context utilisation☆55Updated 5 months ago
- ☆159Updated last year
- Beating the GAIA benchmark with Transformers Agents. 🚀☆135Updated 7 months ago
- The data processing pipeline for the Koala chatbot language model☆118Updated 2 years ago
- ☆117Updated 4 months ago
- CodeSage: Code Representation Learning At Scale (ICLR 2024)☆112Updated 10 months ago
- Open Implementations of LLM Analyses☆107Updated 11 months ago
- Multimodal computer agent data collection program☆146Updated last year
- RepoQA: Evaluating Long-Context Code Understanding☆117Updated 10 months ago
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆62Updated 9 months ago
- Official repo for NAACL 2024 Findings paper "LeTI: Learning to Generate from Textual Interactions."☆64Updated 2 years ago
- WebLINX is a benchmark for building web navigation agents with conversational capabilities☆158Updated 7 months ago
- Public repository containing METR's DVC pipeline for eval data analysis☆108Updated 5 months ago