symflower / eval-dev-qualityLinks
DevQualityEval: An evaluation benchmark π and framework to compare and evolve the quality of code generation of LLMs.
β179Updated 4 months ago
Alternatives and similar repositories for eval-dev-quality
Users that are interested in eval-dev-quality are comparing it to the libraries listed below
Sorting:
- π€ Headless IDE for AI agentsβ201Updated 4 months ago
- Simple examples using Argilla tools to build AIβ55Updated 9 months ago
- Tutorial for building LLM routerβ226Updated last year
- A system that tries to resolve all issues on a github repo with OpenHands.β113Updated 9 months ago
- β104Updated 3 months ago
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.β315Updated this week
- Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platformβ90Updated last week
- Coding problems used in aider's polyglot benchmarkβ179Updated 8 months ago
- Conduct in-depth research with AI-driven insights : DeepDive is a command-line tool that leverages web searches and AI models to generateβ¦β42Updated last year
- β55Updated 7 months ago
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)β92Updated 7 months ago
- ReDel is a toolkit for researchers and developers to build, iterate on, and analyze recursive multi-agent systems. (EMNLP 2024 Demo)β85Updated this week
- Solving data for LLMs - Create quality synthetic datasets!β150Updated 7 months ago
- β171Updated 6 months ago
- proof-of-concept of Cursor's Instant Apply featureβ83Updated last year
- Routing on Random Forest (RoRF)β203Updated 11 months ago
- Contains the prompts we use to talk to various LLMs for different utilities inside the editorβ80Updated last year
- Aider's refactoring benchmark exercises based on popular python reposβ77Updated 11 months ago
- A better way of testing, inspecting, and analyzing AI Agent traces.β40Updated 2 months ago
- β161Updated last month
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive argumentsβ89Updated 11 months ago
- Public repository containing METR's DVC pipeline for eval data analysisβ108Updated 5 months ago
- Just a bunch of benchmark logs for different LLMsβ119Updated last year
- β165Updated 8 months ago
- β231Updated 2 months ago
- Official repository for "NoLiMa: Long-Context Evaluation Beyond Literal Matching"β152Updated 2 months ago
- Distributed Inference for mlx LLmβ95Updated last year
- An AI agent library using Python as the common language to define executable actions and tool interfaces.β84Updated last month
- Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.β225Updated last month
- CursorCore: Assist Programming through Aligning Anythingβ131Updated 7 months ago