symflower / eval-dev-qualityLinks
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
☆178Updated 2 months ago
Alternatives and similar repositories for eval-dev-quality
Users that are interested in eval-dev-quality are comparing it to the libraries listed below
Sorting:
- 🤖 Headless IDE for AI agents☆192Updated 2 months ago
- A system that tries to resolve all issues on a github repo with OpenHands.☆110Updated 7 months ago
- ☆96Updated 10 months ago
- Simple examples using Argilla tools to build AI☆53Updated 7 months ago
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆247Updated this week
- Tutorial for building LLM router☆216Updated 11 months ago
- ☆213Updated 2 weeks ago
- LangEvals aggregates various language model evaluators into a single platform, providing a standard interface for a multitude of scores a…☆60Updated last week
- Awesome Devin-inspired AI agents☆221Updated 4 months ago
- ☆158Updated 10 months ago
- CursorCore: Assist Programming through Aligning Anything☆127Updated 5 months ago
- Routing on Random Forest (RoRF)☆176Updated 9 months ago
- Contains the prompts we use to talk to various LLMs for different utilities inside the editor☆79Updated last year
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)☆91Updated 5 months ago
- Aider's refactoring benchmark exercises based on popular python repos☆75Updated 9 months ago
- Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platform☆87Updated 3 weeks ago
- ReDel is a toolkit for researchers and developers to build, iterate on, and analyze recursive multi-agent systems. (EMNLP 2024 Demo)☆82Updated 4 months ago
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive arguments☆84Updated 9 months ago
- Agent computer interface for AI software engineer.☆89Updated this week
- Code for our paper PAPILLON: PrivAcy Preservation from Internet-based and Local Language MOdel ENsembles☆52Updated 2 months ago
- Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.☆183Updated last week
- A simple Python sandbox for helpful LLM data agents☆272Updated last year
- Solving data for LLMs - Create quality synthetic datasets!☆150Updated 5 months ago
- Harness used to benchmark aider against SWE Bench benchmarks☆72Updated last year
- ☆101Updated last month
- A user interface for DSPy☆162Updated last month
- Function Calling Benchmark & Testing☆87Updated last year
- ☆162Updated 4 months ago
- GraphRAG database - hybrid graph / vector db☆127Updated 9 months ago
- Google Deepmind's PromptBreeder for automated prompt engineering implemented in langchain expression language.☆125Updated 11 months ago