symflower / eval-dev-qualityLinks

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

☆178

Alternatives and similar repositories for eval-dev-quality

Users that are interested in eval-dev-quality are comparing it to the libraries listed below

Sorting:

hide-org / hide
🤖 Headless IDE for AI agents
☆192Updated 2 months ago
All-Hands-AI / openhands-resolver
A system that tries to resolve all issues on a github repo with OpenHands.
☆110Updated 7 months ago
DeepSoftwareAnalytics / Awesome-Agent4SE
☆96Updated 10 months ago
argilla-io / argilla-cookbook
Simple examples using Argilla tools to build AI
☆53Updated 7 months ago
SWE-agent / SWE-ReX
Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.
☆247Updated this week
anyscale / llm-router
Tutorial for building LLM router
☆216Updated 11 months ago
google / lmeval
☆213Updated 2 weeks ago
langwatch / langevals
LangEvals aggregates various language model evaluators into a single platform, providing a standard interface for a multitude of scores a…
☆60Updated last week
e2b-dev / awesome-devins
Awesome Devin-inspired AI agents
☆221Updated 4 months ago
NL2Code / CodeR
☆158Updated 10 months ago
TechxGenus / CursorCore
CursorCore: Assist Programming through Aligning Anything
☆127Updated 5 months ago
Not-Diamond / RoRF
Routing on Random Forest (RoRF)
☆176Updated 9 months ago
codestoryai / prompts
Contains the prompts we use to talk to various LLMs for different utilities inside the editor
☆79Updated last year
Xalp / ECHO
Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)
☆91Updated 5 months ago
Aider-AI / refactor-benchmark
Aider's refactoring benchmark exercises based on popular python repos
☆75Updated 9 months ago
h2oai / enterprise-h2ogpte
Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platform
☆87Updated 3 weeks ago
zhudotexe / redel
ReDel is a toolkit for researchers and developers to build, iterate on, and analyze recursive multi-agent systems. (EMNLP 2024 Demo)
☆82Updated 4 months ago
zbambergerNLP / strategic-debate-tot
A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive arguments
☆84Updated 9 months ago
All-Hands-AI / openhands-aci
Agent computer interface for AI software engineer.
☆89Updated this week
Columbia-NLP-Lab / PAPILLON
Code for our paper PAPILLON: PrivAcy Preservation from Internet-based and Local Language MOdel ENsembles
☆52Updated 2 months ago
lechmazur / confabulations
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
☆183Updated last week
cohere-ai / cohere-terrarium
A simple Python sandbox for helpful LLM data agents
☆272Updated last year
BhabhaAI / dataformer
Solving data for LLMs - Create quality synthetic datasets!
☆150Updated 5 months ago
Aider-AI / aider-swe-bench
Harness used to benchmark aider against SWE Bench benchmarks
☆72Updated last year
cognitivecomputations / dolphin-logger
☆101Updated last month
hammer-mt / DSPyUI
A user interface for DSPy
☆162Updated last month
ComposioHQ / Composio-Function-Calling-Benchmark
Function Calling Benchmark & Testing
☆87Updated last year
agora-protocol / paper-demo
☆162Updated 4 months ago
bradAGI / GraphMemory
GraphRAG database - hybrid graph / vector db
☆127Updated 9 months ago
vaughanlove / PromptBreeder
Google Deepmind's PromptBreeder for automated prompt engineering implemented in langchain expression language.
☆125Updated 11 months ago