benchflow-ai / benchflowLinks
AI benchmark runtime framework that allows you to integrate and evaluate AI tasks using Docker-based benchmarks.
☆158Updated 4 months ago
Alternatives and similar repositories for benchflow
Users that are interested in benchflow are comparing it to the libraries listed below
Sorting:
- ⚖️ Awesome LLM Judges ⚖️☆128Updated 4 months ago
- OSS RL environment + evals toolkit☆181Updated this week
- MCPMark is a comprehensive, stress-testing MCP benchmark designed to evaluate model and agent capabilities in real-world MCP use.☆154Updated last week
- Challenges for general-purpose web-browsing AI agents☆65Updated 3 months ago
- ☆171Updated 6 months ago
- Prompt-to-Leaderboard☆254Updated 4 months ago
- Agent computer interface for AI software engineer.☆111Updated last week
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆321Updated this week
- Letting Claude Code develop his own MCP tools :)☆121Updated 6 months ago
- AWM: Agent Workflow Memory☆325Updated 7 months ago
- Routing on Random Forest (RoRF)☆206Updated last year
- Verify Precision of all Kimi K2 API Vendor☆139Updated this week
- An Agentic Deep Research Assistant similar to Gemini and OpenAI Deep Research☆123Updated 7 months ago
- Commit0: Library Generation from Scratch☆167Updated 4 months ago
- τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment☆318Updated last month
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents☆407Updated this week
- ☆231Updated 2 months ago
- Solving data for LLMs - Create quality synthetic datasets!☆150Updated 8 months ago
- Training setup for Langchain's Open Deep Research☆58Updated 3 weeks ago
- Training-Ready RL Environments + Evals☆111Updated this week
- Together Open Deep Research☆349Updated 5 months ago
- Benchmarking Chat Assistants on Long-Term Interactive Memory (ICLR 2025)☆213Updated last week
- Beating the GAIA benchmark with Transformers Agents. 🚀☆136Updated 7 months ago
- TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycle☆296Updated this week
- Prompt design in Python☆62Updated 10 months ago
- 🦀️ CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents. https://crab.camel-ai.org/☆374Updated 2 months ago
- All-in-One Sandbox for AI Agents that combines Browser, Shell, File, MCP and VSCode Server in a single Docker container.☆141Updated this week
- SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?☆107Updated this week
- rl from zero pretrain, can it be done? yes.☆269Updated this week
- Multi-Faceted AI Agent and Workflow Autotuning. Automatically optimizes LangChain, LangGraph, DSPy programs for better quality, lower exe…☆253Updated 4 months ago