benchflow-ai / benchflowLinks
AI benchmark runtime framework that allows you to integrate and evaluate AI tasks using Docker-based benchmarks.
☆158Updated 5 months ago
Alternatives and similar repositories for benchflow
Users that are interested in benchflow are comparing it to the libraries listed below
Sorting:
- OSS RL environment + evals toolkit☆189Updated this week
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆331Updated 2 weeks ago
- ⚖️ Awesome LLM Judges ⚖️☆131Updated 5 months ago
- ☆170Updated 7 months ago
- Verify Precision of all Kimi K2 API Vendor☆258Updated last week
- Commit0: Library Generation from Scratch☆169Updated 5 months ago
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents☆429Updated last week
- Beating the GAIA benchmark with Transformers Agents. 🚀☆138Updated 8 months ago
- TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycle☆298Updated this week
- Challenges for general-purpose web-browsing AI agents☆64Updated 4 months ago
- Multi-Faceted AI Agent and Workflow Autotuning. Automatically optimizes LangChain, LangGraph, DSPy programs for better quality, lower exe…☆258Updated 5 months ago
- Coding problems used in aider's polyglot benchmark☆183Updated 9 months ago
- Together Open Deep Research☆351Updated 6 months ago
- Solving data for LLMs - Create quality synthetic datasets!☆151Updated 9 months ago
- AWM: Agent Workflow Memory☆332Updated 8 months ago
- Agent computer interface for AI software engineer.☆111Updated last month
- ☆246Updated last year
- Real-Time Detection of Hallucinated Entities in Long-Form Generation☆258Updated last month
- Letting Claude Code develop his own MCP tools :)☆123Updated 7 months ago
- [ACL 2024] Do Large Language Models Latently Perform Multi-Hop Reasoning?☆77Updated 7 months ago
- Prompt-to-Leaderboard☆259Updated 5 months ago
- Training-Ready RL Environments + Evals☆128Updated this week
- CursorCore: Assist Programming through Aligning Anything☆131Updated 8 months ago
- ☆232Updated 3 months ago
- 🦀️ CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents. https://crab.camel-ai.org/☆376Updated 3 months ago
- A prompting library☆182Updated 3 months ago
- Run AI generated code in isolated sandboxes☆112Updated 8 months ago
- The Open Deep Research app – generate reports with OSS LLMs☆302Updated 3 months ago
- ☆68Updated 4 months ago
- An agent benchmark with tasks in a simulated software company.☆564Updated last week