symflower / eval-dev-qualityLinks
DevQualityEval: An evaluation benchmark π and framework to compare and evolve the quality of code generation of LLMs.
β180Updated 3 months ago
Alternatives and similar repositories for eval-dev-quality
Users that are interested in eval-dev-quality are comparing it to the libraries listed below
Sorting:
- Tutorial for building LLM routerβ224Updated last year
- π€ Headless IDE for AI agentsβ200Updated 4 months ago
- A system that tries to resolve all issues on a github repo with OpenHands.β112Updated 9 months ago
- Coding problems used in aider's polyglot benchmarkβ174Updated 8 months ago
- Simple examples using Argilla tools to build AIβ53Updated 9 months ago
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.β286Updated last week
- Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.β218Updated 2 weeks ago
- β229Updated last month
- Routing on Random Forest (RoRF)β195Updated 11 months ago
- Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platformβ89Updated 2 months ago
- Conduct in-depth research with AI-driven insights : DeepDive is a command-line tool that leverages web searches and AI models to generateβ¦β42Updated last year
- Contains the prompts we use to talk to various LLMs for different utilities inside the editorβ80Updated last year
- β98Updated 11 months ago
- Beating the GAIA benchmark with Transformers Agents. πβ133Updated 6 months ago
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive argumentsβ88Updated 10 months ago
- Harness used to benchmark aider against SWE Bench benchmarksβ71Updated last year
- β102Updated 2 months ago
- A better way of testing, inspecting, and analyzing AI Agent traces.β40Updated last month
- A Text-Based Environment for Interactive Debuggingβ256Updated last week
- A python package for serving LLM on OpenAI-compatible API endpoints with prompt caching using MLX.β92Updated last month
- β161Updated 8 months ago
- Public repository containing METR's DVC pipeline for eval data analysisβ96Updated 4 months ago
- CursorCore: Assist Programming through Aligning Anythingβ131Updated 6 months ago
- β161Updated 2 weeks ago
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)β92Updated 7 months ago
- Awesome Devin-inspired AI agentsβ224Updated 6 months ago
- Official repository for "DynaSaur: Large Language Agents Beyond Predefined Actions"β348Updated 8 months ago
- A simple Python sandbox for helpful LLM data agentsβ279Updated last year
- β132Updated 4 months ago
- An easy-to-understand framework for LLM samplers that rewind and revise generated tokensβ146Updated 6 months ago