symflower / eval-dev-qualityLinks
DevQualityEval: An evaluation benchmark π and framework to compare and evolve the quality of code generation of LLMs.
β182Updated 4 months ago
Alternatives and similar repositories for eval-dev-quality
Users that are interested in eval-dev-quality are comparing it to the libraries listed below
Sorting:
- Tutorial for building LLM routerβ228Updated last year
- Simple examples using Argilla tools to build AIβ56Updated 10 months ago
- Contains the prompts we use to talk to various LLMs for different utilities inside the editorβ83Updated last year
- β100Updated last year
- Public repository containing METR's DVC pipeline for eval data analysisβ117Updated 6 months ago
- Routing on Random Forest (RoRF)β211Updated last year
- Experimental Code for StructuredRAG: JSON Response Formatting with Large Language Modelsβ111Updated 5 months ago
- β104Updated 3 months ago
- ReDel is a toolkit for researchers and developers to build, iterate on, and analyze recursive multi-agent systems. (EMNLP 2024 Demo)β86Updated 3 weeks ago
- A system that tries to resolve all issues on a github repo with OpenHands.β113Updated 10 months ago
- β166Updated 9 months ago
- CursorCore: Assist Programming through Aligning Anythingβ131Updated 7 months ago
- Coding problems used in aider's polyglot benchmarkβ182Updated 9 months ago
- Conduct in-depth research with AI-driven insights : DeepDive is a command-line tool that leverages web searches and AI models to generateβ¦β42Updated last year
- Google Deepmind's PromptBreeder for automated prompt engineering implemented in langchain expression language.β147Updated last year
- Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.β228Updated 2 months ago
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)β91Updated 8 months ago
- π€ Headless IDE for AI agentsβ202Updated 5 months ago
- Finetune Llama-3-8b on the MathInstruct datasetβ111Updated 11 months ago
- β170Updated 7 months ago
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.β327Updated this week
- β232Updated 3 months ago
- GPT-4 Level Conversational QA Trained In a Few Hoursβ65Updated last year
- Awesome Devin-inspired AI agentsβ226Updated 7 months ago
- proof-of-concept of Cursor's Instant Apply featureβ83Updated last year
- A Python library to orchestrate LLMs in a neural network-inspired structureβ50Updated last year
- Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platformβ90Updated 3 weeks ago
- Letting Claude Code develop his own MCP tools :)β122Updated 6 months ago
- A toolkit for building computer use AI agentsβ177Updated 3 months ago
- β133Updated 5 months ago