benchflow-ai / benchflowLinks
AI benchmark runtime framework that allows you to integrate and evaluate AI tasks using Docker-based benchmarks.
☆155Updated last month
Alternatives and similar repositories for benchflow
Users that are interested in benchflow are comparing it to the libraries listed below
Sorting:
- ⚖️ Awesome LLM Judges ⚖️☆105Updated last month
- Challenges for general-purpose web-browsing AI agents☆58Updated 3 weeks ago
- A Deep Research agent from scratch☆186Updated last month
- Prompt engineering, automated.☆329Updated 2 months ago
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]☆486Updated last month
- Scale your LLM-as-a-judge.☆240Updated 2 weeks ago
- A lightweight framework for building research agents designed for developers☆95Updated this week
- Letting Claude Code develop his own MCP tools :)☆113Updated 3 months ago
- Scaling Data for SWE-agents☆256Updated this week
- Prompt design in Python☆60Updated 6 months ago
- AWM: Agent Workflow Memory☆275Updated 4 months ago
- List of Open Source projects built on Browser Use☆76Updated last month
- A benchmark for LLMs on complicated tasks in the terminal☆177Updated this week
- [ACL 2024] Do Large Language Models Latently Perform Multi-Hop Reasoning?☆68Updated 3 months ago
- AGI SDK☆60Updated last week
- Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"☆551Updated 3 months ago
- Prompt-to-Leaderboard☆239Updated last month
- Solving data for LLMs - Create quality synthetic datasets!☆149Updated 5 months ago
- CodeScientist: An automated scientific discovery system for code-based experiments☆271Updated 2 months ago
- Together Open Deep Research☆309Updated 2 months ago
- Agent computer interface for AI software engineer.☆85Updated this week
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆228Updated this week
- An implementation of a computer use agent (CUA) using LangGraph☆158Updated 2 months ago
- Claude Deep Research config for Claude Code.☆187Updated 3 months ago
- prime-rl is a codebase for decentralized async RL training at scale☆341Updated this week
- Train your own SOTA deductive reasoning model☆94Updated 3 months ago
- II-Researcher: a new open-source framework designed to aid building search / research agents☆376Updated last month
- HUD SDK☆64Updated this week
- Coding problems used in aider's polyglot benchmark☆141Updated 6 months ago
- An MCP Server that's also an MCP Client. Useful for letting Claude develop and test MCPs without needing to reset the application.☆120Updated 3 months ago