symflower / eval-dev-qualityLinks
DevQualityEval: An evaluation benchmark π and framework to compare and evolve the quality of code generation of LLMs.
β184Updated 8 months ago
Alternatives and similar repositories for eval-dev-quality
Users that are interested in eval-dev-quality are comparing it to the libraries listed below
Sorting:
- Simple examples using Argilla tools to build AIβ57Updated last year
- Tutorial for building LLM routerβ242Updated last year
- ReDel is a toolkit for researchers and developers to build, iterate on, and analyze recursive multi-agent systems. (EMNLP 2024 Demo)β90Updated last month
- Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platformβ90Updated 4 months ago
- π€ Headless IDE for AI agentsβ199Updated 3 months ago
- β57Updated 11 months ago
- Coding problems used in aider's polyglot benchmarkβ199Updated last year
- β236Updated last month
- Just a bunch of benchmark logs for different LLMsβ119Updated last year
- A system that tries to resolve all issues on a github repo with OpenHands.β117Updated last year
- Harness used to benchmark aider against SWE Bench benchmarksβ78Updated last year
- A better way of testing, inspecting, and analyzing AI Agent traces.β40Updated this week
- β107Updated 2 months ago
- Routing on Random Forest (RoRF)β238Updated last year
- Experimental Code for StructuredRAG: JSON Response Formatting with Large Language Modelsβ114Updated 9 months ago
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive argumentsβ96Updated 3 months ago
- β106Updated last year
- Conduct in-depth research with AI-driven insights : DeepDive is a command-line tool that leverages web searches and AI models to generateβ¦β44Updated last year
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)β91Updated 11 months ago
- Code for evaluating with Flow-Judge-v0.1 - an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafteβ¦β81Updated last year
- A Ruby on Rails style framework for the DSPy (Demonstrate, Search, Predict) project for Language Models like GPT, BERT, and LLama.β132Updated last year
- β119Updated last year
- An automated tool for discovering insights from research papaer corporaβ138Updated last year
- Public repository containing METR's DVC pipeline for eval data analysisβ178Updated 9 months ago
- β189Updated last year
- A simple Python sandbox for helpful LLM data agentsβ302Updated last year
- β68Updated last year
- Ongoing research training transformer models at scaleβ38Updated last year
- β159Updated last year
- LangEvals aggregates various language model evaluators into a single platform, providing a standard interface for a multitude of scores aβ¦β70Updated last week