benchflow-ai / benchflowLinks

AI benchmark runtime framework that allows you to integrate and evaluate AI tasks using Docker-based benchmarks.

☆154

Alternatives and similar repositories for benchflow

Users that are interested in benchflow are comparing it to the libraries listed below

Sorting:

haizelabs / Awesome-LLM-Judges
⚖️ Awesome LLM Judges ⚖️
☆107Updated 2 months ago
Aider-AI / polyglot-benchmark
Coding problems used in aider's polyglot benchmark
☆155Updated 6 months ago
lmarena / p2l
Prompt-to-Leaderboard
☆241Updated 2 months ago
All-Hands-AI / openhands-aci
Agent computer interface for AI software engineer.
☆89Updated this week
convergence-ai / webgames
Challenges for general-purpose web-browsing AI agents
☆60Updated last month
agora-protocol / paper-demo
☆162Updated 4 months ago
SWE-bench / SWE-smith
Scaling Data for SWE-agents
☆293Updated this week
SWE-agent / SWE-ReX
Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.
☆247Updated this week
willccbb / claude-code-mcp
Letting Claude Code develop his own MCP tools :)
☆114Updated 4 months ago
commit-0 / commit0
Commit0: Library Generation from Scratch
☆158Updated 2 months ago
Not-Diamond / RoRF
Routing on Random Forest (RoRF)
☆176Updated 9 months ago
BhabhaAI / dataformer
Solving data for LLMs - Create quality synthetic datasets!
☆150Updated 5 months ago
google / lmeval
☆213Updated 2 weeks ago
togethercomputer / open_deep_research
Together Open Deep Research
☆320Updated 3 months ago
google-deepmind / latent-multi-hop-reasoning
[ACL 2024] Do Large Language Models Latently Perform Multi-Hop Reasoning?
☆71Updated 3 months ago
haizelabs / verdict
Inference-time scaling for LLMs-as-a-judge.
☆251Updated this week
haizelabs / j1-micro
j1-micro (1.7B) & j1-nano (600M) are absurdly tiny but mighty reward models.
☆91Updated last month
TheAgentCompany / TheAgentCompany
An agent benchmark with tasks in a simulated software company.
☆488Updated last week
brendanhogan / picoDeepResearch
☆64Updated last month
PrimeIntellect-ai / genesys
☆129Updated 3 months ago
zorazrw / agent-workflow-memory
AWM: Agent Workflow Memory
☆291Updated 5 months ago
aorwall / moatless-tree-search
☆94Updated last month
willccbb / claude-deep-research
Claude Deep Research config for Claude Code.
☆196Updated 4 months ago
anthropic-experimental / agentic-misalignment
☆306Updated 3 weeks ago
zenbase-ai / core
Prompt engineering, automated.
☆331Updated 2 months ago
arklexai / Agent-First-Organization
The official Python library for Arklex framework
☆253Updated this week
llllvvuu / instant_apply
proof-of-concept of Cursor's Instant Apply feature
☆83Updated 10 months ago
hud-evals / hud-sdk
HUD SDK
☆71Updated this week
SouthBridgeAI / diagen
☆226Updated 9 months ago
All-Hands-AI / openhands-resolver
A system that tries to resolve all issues on a github repo with OpenHands.
☆110Updated 7 months ago