symflower / eval-dev-qualityLinks
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
☆182Updated 5 months ago
Alternatives and similar repositories for eval-dev-quality
Users that are interested in eval-dev-quality are comparing it to the libraries listed below
Sorting:
- Tutorial for building LLM router☆231Updated last year
- Simple examples using Argilla tools to build AI☆56Updated 11 months ago
- Public repository containing METR's DVC pipeline for eval data analysis☆124Updated 6 months ago
- Coding problems used in aider's polyglot benchmark☆184Updated 10 months ago
- Just a bunch of benchmark logs for different LLMs☆118Updated last year
- ☆170Updated 10 months ago
- 🤖 Headless IDE for AI agents☆200Updated 2 weeks ago
- SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?☆202Updated this week
- A simple Python sandbox for helpful LLM data agents☆285Updated last year
- ☆162Updated 2 months ago
- Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platform☆90Updated last month
- A system that tries to resolve all issues on a github repo with OpenHands.☆114Updated 11 months ago
- Routing on Random Forest (RoRF)☆214Updated last year
- An AI agent library using Python as the common language to define executable actions and tool interfaces.☆85Updated 2 months ago
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive arguments☆90Updated 3 weeks ago
- ☆160Updated last year
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆349Updated this week
- Harness used to benchmark aider against SWE Bench benchmarks☆76Updated last year
- A better way of testing, inspecting, and analyzing AI Agent traces.☆40Updated last week
- CursorCore: Assist Programming through Aligning Anything☆131Updated 8 months ago
- ☆104Updated 4 months ago
- Fast parallel LLM inference for MLX☆224Updated last year
- ☆101Updated last year
- Contains the prompts we use to talk to various LLMs for different utilities inside the editor☆83Updated last year
- Google Deepmind's PromptBreeder for automated prompt engineering implemented in langchain expression language.☆151Updated last year
- A comprehensive repository of reasoning tasks for LLMs (and beyond)☆450Updated last year
- proof-of-concept of Cursor's Instant Apply feature☆83Updated last year
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)☆91Updated 9 months ago
- Beating the GAIA benchmark with Transformers Agents. 🚀☆138Updated 8 months ago
- CodeSage: Code Representation Learning At Scale (ICLR 2024)☆113Updated last year