benchflow-ai / benchflowLinks
AI benchmark runtime framework that allows you to integrate and evaluate AI tasks using Docker-based benchmarks.
☆154Updated 2 months ago
Alternatives and similar repositories for benchflow
Users that are interested in benchflow are comparing it to the libraries listed below
Sorting:
- ⚖️ Awesome LLM Judges ⚖️☆107Updated 2 months ago
- Coding problems used in aider's polyglot benchmark☆155Updated 6 months ago
- Prompt-to-Leaderboard☆241Updated 2 months ago
- Agent computer interface for AI software engineer.☆89Updated this week
- Challenges for general-purpose web-browsing AI agents☆60Updated last month
- ☆162Updated 4 months ago
- Scaling Data for SWE-agents☆293Updated this week
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆247Updated this week
- Letting Claude Code develop his own MCP tools :)☆114Updated 4 months ago
- Commit0: Library Generation from Scratch☆158Updated 2 months ago
- Routing on Random Forest (RoRF)☆176Updated 9 months ago
- Solving data for LLMs - Create quality synthetic datasets!☆150Updated 5 months ago
- ☆213Updated 2 weeks ago
- Together Open Deep Research☆320Updated 3 months ago
- [ACL 2024] Do Large Language Models Latently Perform Multi-Hop Reasoning?☆71Updated 3 months ago
- Inference-time scaling for LLMs-as-a-judge.☆251Updated this week
- j1-micro (1.7B) & j1-nano (600M) are absurdly tiny but mighty reward models.☆91Updated last month
- An agent benchmark with tasks in a simulated software company.☆488Updated last week
- ☆64Updated last month
- ☆129Updated 3 months ago
- AWM: Agent Workflow Memory☆291Updated 5 months ago
- ☆94Updated last month
- Claude Deep Research config for Claude Code.☆196Updated 4 months ago
- ☆306Updated 3 weeks ago
- Prompt engineering, automated.☆331Updated 2 months ago
- The official Python library for Arklex framework☆253Updated this week
- proof-of-concept of Cursor's Instant Apply feature☆83Updated 10 months ago
- HUD SDK☆71Updated this week
- ☆226Updated 9 months ago
- A system that tries to resolve all issues on a github repo with OpenHands.☆110Updated 7 months ago