symflower / eval-dev-qualityLinks
DevQualityEval: An evaluation benchmark π and framework to compare and evolve the quality of code generation of LLMs.
β185Updated 8 months ago
Alternatives and similar repositories for eval-dev-quality
Users that are interested in eval-dev-quality are comparing it to the libraries listed below
Sorting:
- Tutorial for building LLM routerβ244Updated last year
- Simple examples using Argilla tools to build AIβ57Updated last year
- π€ Headless IDE for AI agentsβ200Updated 3 months ago
- Public repository containing METR's DVC pipeline for eval data analysisβ189Updated last week
- Conduct in-depth research with AI-driven insights : DeepDive is a command-line tool that leverages web searches and AI models to generateβ¦β44Updated last year
- Harness used to benchmark aider against SWE Bench benchmarksβ79Updated last year
- Just a bunch of benchmark logs for different LLMsβ119Updated last year
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)β92Updated last year
- β107Updated 3 months ago
- A system that tries to resolve all issues on a github repo with OpenHands.β117Updated last year
- Coding problems used in aider's polyglot benchmarkβ199Updated last year
- ReDel is a toolkit for researchers and developers to build, iterate on, and analyze recursive multi-agent systems. (EMNLP 2024 Demo)β90Updated last month
- β119Updated last year
- Function Calling Benchmark & Testingβ92Updated last year
- Experimental Code for StructuredRAG: JSON Response Formatting with Large Language Modelsβ115Updated 9 months ago
- Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player βstep-raceβ that challenges LLMβ¦β85Updated last month
- Contains the prompts we use to talk to various LLMs for different utilities inside the editorβ84Updated 2 years ago
- β177Updated 11 months ago
- Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platformβ90Updated 4 months ago
- Routing on Random Forest (RoRF)β239Updated last year
- β106Updated last year
- Finetune Llama-3-8b on the MathInstruct datasetβ115Updated last year
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive argumentsβ99Updated 4 months ago
- Aider's refactoring benchmark exercises based on popular python reposβ78Updated last year
- A user interface for DSPyβ210Updated 4 months ago
- β166Updated 6 months ago
- β59Updated last year
- Letting Claude Code develop his own MCP tools :)β123Updated 10 months ago
- Foyle is a copilot to help developers deploy and operate their applications.β133Updated 10 months ago
- A Text-Based Environment for Interactive Debuggingβ293Updated this week