symflower / eval-dev-qualityLinks
DevQualityEval: An evaluation benchmark π and framework to compare and evolve the quality of code generation of LLMs.
β182Updated 6 months ago
Alternatives and similar repositories for eval-dev-quality
Users that are interested in eval-dev-quality are comparing it to the libraries listed below
Sorting:
- Tutorial for building LLM routerβ235Updated last year
- Coding problems used in aider's polyglot benchmarkβ190Updated 10 months ago
- A system that tries to resolve all issues on a github repo with OpenHands.β115Updated 11 months ago
- Simple examples using Argilla tools to build AIβ56Updated 11 months ago
- β171Updated 10 months ago
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.β358Updated last week
- π€ Headless IDE for AI agentsβ201Updated last month
- CursorCore: Assist Programming through Aligning Anythingβ132Updated 9 months ago
- β107Updated 2 weeks ago
- β59Updated 9 months ago
- Routing on Random Forest (RoRF)β220Updated last year
- Public repository containing METR's DVC pipeline for eval data analysisβ129Updated 7 months ago
- β172Updated 8 months ago
- Beating the GAIA benchmark with Transformers Agents. πβ138Updated 8 months ago
- Harness used to benchmark aider against SWE Bench benchmarksβ77Updated last year
- Conduct in-depth research with AI-driven insights : DeepDive is a command-line tool that leverages web searches and AI models to generateβ¦β42Updated last year
- Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platformβ91Updated 2 months ago
- Contains the prompts we use to talk to various LLMs for different utilities inside the editorβ83Updated last year
- Experimental Code for StructuredRAG: JSON Response Formatting with Large Language Modelsβ111Updated 7 months ago
- Building open version of OpenAI o1 via reasoning traces (Groq, ollama, Anthropic, Gemini, OpenAI, Azure supported) Demo: https://huggingβ¦β184Updated last year
- Google Deepmind's PromptBreeder for automated prompt engineering implemented in langchain expression language.β154Updated last year
- Train your own SOTA deductive reasoning modelβ108Updated 8 months ago
- Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)β80Updated 9 months ago
- βοΈ Awesome LLM Judges βοΈβ133Updated 6 months ago
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)β91Updated 9 months ago
- Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.β231Updated 3 months ago
- A simple Python sandbox for helpful LLM data agentsβ292Updated last year
- Just a bunch of benchmark logs for different LLMsβ118Updated last year
- β101Updated last year
- SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?β212Updated last week