symflower / eval-dev-qualityLinks
DevQualityEval: An evaluation benchmark π and framework to compare and evolve the quality of code generation of LLMs.
β182Updated 7 months ago
Alternatives and similar repositories for eval-dev-quality
Users that are interested in eval-dev-quality are comparing it to the libraries listed below
Sorting:
- Tutorial for building LLM routerβ239Updated last year
- Simple examples using Argilla tools to build AIβ57Updated last year
- π€ Headless IDE for AI agentsβ200Updated 2 months ago
- β189Updated last year
- An AI agent library using Python as the common language to define executable actions and tool interfaces.β115Updated last month
- A system that tries to resolve all issues on a github repo with OpenHands.β117Updated last year
- Contains the prompts we use to talk to various LLMs for different utilities inside the editorβ83Updated last year
- β235Updated last month
- Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platformβ91Updated 3 months ago
- Coding problems used in aider's polyglot benchmarkβ199Updated last year
- CursorCore: Assist Programming through Aligning Anythingβ133Updated 10 months ago
- β107Updated last month
- ReDel is a toolkit for researchers and developers to build, iterate on, and analyze recursive multi-agent systems. (EMNLP 2024 Demo)β89Updated 2 weeks ago
- Conduct in-depth research with AI-driven insights : DeepDive is a command-line tool that leverages web searches and AI models to generateβ¦β43Updated last year
- Routing on Random Forest (RoRF)β235Updated last year
- A python package for serving LLM on OpenAI-compatible API endpoints with prompt caching using MLX.β99Updated 5 months ago
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)β91Updated 11 months ago
- Experimental Code for StructuredRAG: JSON Response Formatting with Large Language Modelsβ115Updated 8 months ago
- Building open version of OpenAI o1 via reasoning traces (Groq, ollama, Anthropic, Gemini, OpenAI, Azure supported) Demo: https://huggingβ¦β187Updated last year
- Code for evaluating with Flow-Judge-v0.1 - an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafteβ¦β78Updated last year
- A simple Python sandbox for helpful LLM data agentsβ299Updated last year
- Public repository containing METR's DVC pipeline for eval data analysisβ164Updated 8 months ago
- proof-of-concept of Cursor's Instant Apply featureβ87Updated last year
- β59Updated 11 months ago
- Just a bunch of benchmark logs for different LLMsβ119Updated last year
- A user interface for DSPyβ204Updated 2 months ago
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.β396Updated this week
- A better way of testing, inspecting, and analyzing AI Agent traces.β40Updated 2 months ago
- Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.β240Updated 4 months ago
- Google Deepmind's PromptBreeder for automated prompt engineering implemented in langchain expression language.β161Updated last year