symflower / eval-dev-qualityLinks
DevQualityEval: An evaluation benchmark π and framework to compare and evolve the quality of code generation of LLMs.
β182Updated 6 months ago
Alternatives and similar repositories for eval-dev-quality
Users that are interested in eval-dev-quality are comparing it to the libraries listed below
Sorting:
- Simple examples using Argilla tools to build AIβ56Updated last year
- A system that tries to resolve all issues on a github repo with OpenHands.β117Updated last year
- Tutorial for building LLM routerβ236Updated last year
- Public repository containing METR's DVC pipeline for eval data analysisβ140Updated 8 months ago
- Routing on Random Forest (RoRF)β226Updated last year
- Coding problems used in aider's polyglot benchmarkβ194Updated 11 months ago
- β179Updated 11 months ago
- β164Updated 4 months ago
- ReDel is a toolkit for researchers and developers to build, iterate on, and analyze recursive multi-agent systems. (EMNLP 2024 Demo)β89Updated this week
- π€ Headless IDE for AI agentsβ200Updated last month
- β107Updated last month
- β117Updated 11 months ago
- Harness used to benchmark aider against SWE Bench benchmarksβ78Updated last year
- Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platformβ91Updated 2 months ago
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive argumentsβ93Updated 2 months ago
- Just a bunch of benchmark logs for different LLMsβ119Updated last year
- proof-of-concept of Cursor's Instant Apply featureβ87Updated last year
- β173Updated 9 months ago
- Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.β238Updated 4 months ago
- Function Calling Benchmark & Testingβ92Updated last year
- A comprehensive repository of reasoning tasks for LLMs (and beyond)β452Updated last year
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)β91Updated 10 months ago
- Conduct in-depth research with AI-driven insights : DeepDive is a command-line tool that leverages web searches and AI models to generateβ¦β43Updated last year
- Google Deepmind's PromptBreeder for automated prompt engineering implemented in langchain expression language.β159Updated last year
- A simple Python sandbox for helpful LLM data agentsβ297Updated last year
- Contains the prompts we use to talk to various LLMs for different utilities inside the editorβ83Updated last year
- GPT-4 Level Conversational QA Trained In a Few Hoursβ66Updated last year
- Experimental Code for StructuredRAG: JSON Response Formatting with Large Language Modelsβ115Updated 7 months ago
- A user interface for DSPyβ200Updated 2 months ago
- β102Updated last year