symflower / eval-dev-qualityLinks
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
☆176Updated last month
Alternatives and similar repositories for eval-dev-quality
Users that are interested in eval-dev-quality are comparing it to the libraries listed below
Sorting:
- 🤖 Headless IDE for AI agents☆191Updated 2 months ago
- Simple examples using Argilla tools to build AI☆53Updated 7 months ago
- Contains the prompts we use to talk to various LLMs for different utilities inside the editor☆78Updated last year
- ☆96Updated last week
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆228Updated this week
- A python package for serving LLM on OpenAI-compatible API endpoints with prompt caching using MLX.☆85Updated last week
- ☆158Updated 9 months ago
- Train Large Language Models on MLX.☆94Updated this week
- Tutorial for building LLM router☆210Updated 11 months ago
- Code for ScribeAgent paper☆58Updated 3 months ago
- Agent computer interface for AI software engineer.☆85Updated this week
- Letting Claude Code develop his own MCP tools :)☆113Updated 3 months ago
- Scaling Data for SWE-agents☆256Updated this week
- Harness used to benchmark aider against SWE Bench benchmarks☆72Updated 11 months ago
- A simple Python sandbox for helpful LLM data agents☆267Updated last year
- A user interface for DSPy☆160Updated last month
- proof-of-concept of Cursor's Instant Apply feature☆82Updated 9 months ago
- ☆130Updated last month
- Run AI generated code in isolated sandboxes☆83Updated 4 months ago
- ☆156Updated 3 months ago
- ReDel is a toolkit for researchers and developers to build, iterate on, and analyze recursive multi-agent systems. (EMNLP 2024 Demo)☆80Updated 3 months ago
- 😎 Awesome list of resources about using and building AI software development systems☆110Updated last year
- Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platform☆87Updated 2 weeks ago
- ⚖️ Awesome LLM Judges ⚖️☆105Updated last month
- auto fine tune of models with synthetic data☆75Updated last year
- Aider's refactoring benchmark exercises based on popular python repos☆74Updated 8 months ago
- Train your own SOTA deductive reasoning model☆94Updated 3 months ago
- An automated tool for discovering insights from research papaer corpora☆138Updated last year
- Finetune Llama-3-8b on the MathInstruct dataset☆110Updated 8 months ago
- Guardrails for secure and robust agent development☆305Updated 2 weeks ago