symflower / eval-dev-qualityLinks
DevQualityEval: An evaluation benchmark π and framework to compare and evolve the quality of code generation of LLMs.
β185Updated 8 months ago
Alternatives and similar repositories for eval-dev-quality
Users that are interested in eval-dev-quality are comparing it to the libraries listed below
Sorting:
- Simple examples using Argilla tools to build AIβ57Updated last year
- Public repository containing METR's DVC pipeline for eval data analysisβ199Updated last week
- Tutorial for building LLM routerβ244Updated last year
- A system that tries to resolve all issues on a github repo with OpenHands.β117Updated last year
- π€ Headless IDE for AI agentsβ200Updated 3 months ago
- Harness used to benchmark aider against SWE Bench benchmarksβ79Updated last year
- ReDel is a toolkit for researchers and developers to build, iterate on, and analyze recursive multi-agent systems. (EMNLP 2024 Demo)β90Updated last month
- β177Updated 11 months ago
- Just a bunch of benchmark logs for different LLMsβ119Updated last year
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)β92Updated last year
- Foyle is a copilot to help developers deploy and operate their applications.β133Updated 10 months ago
- β107Updated 3 months ago
- Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platformβ90Updated 4 months ago
- β237Updated 2 months ago
- β106Updated last year
- Synthetic Data for LLM Fine-Tuningβ120Updated 2 years ago
- Conduct in-depth research with AI-driven insights : DeepDive is a command-line tool that leverages web searches and AI models to generateβ¦β44Updated last year
- β159Updated last year
- β80Updated 4 months ago
- An easy-to-understand framework for LLM samplers that rewind and revise generated tokensβ150Updated last month
- β119Updated last year
- Routing on Random Forest (RoRF)β239Updated last year
- β59Updated last year
- Contains the prompts we use to talk to various LLMs for different utilities inside the editorβ84Updated 2 years ago
- Coding problems used in aider's polyglot benchmarkβ199Updated last year
- β166Updated 6 months ago
- Experimental Code for StructuredRAG: JSON Response Formatting with Large Language Modelsβ115Updated 9 months ago
- Train your own SOTA deductive reasoning modelβ107Updated 11 months ago
- β190Updated last year
- Official repository for "DynaSaur: Large Language Agents Beyond Predefined Actions"β355Updated last year