benchflow-ai / benchflowLinks
AI benchmark runtime framework that allows you to integrate and evaluate AI tasks using Docker-based benchmarks.
☆177Updated last month
Alternatives and similar repositories for benchflow
Users that are interested in benchflow are comparing it to the libraries listed below
Sorting:
- Harbor is a framework for running agent evaluations and creating and using RL environments.☆488Updated this week
- The LLM abstraction layer for modern AI agent applications.☆499Updated this week
- Challenges for general-purpose web-browsing AI agents☆67Updated 7 months ago
- Curated collection of community environments☆208Updated this week
- ⚖️ Awesome LLM Judges ⚖️☆148Updated 9 months ago
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆415Updated last week
- ☆177Updated 10 months ago
- A clean, modular SDK for building AI agents with OpenHands V1.☆459Updated this week
- AWM: Agent Workflow Memory☆387Updated last month
- Prompt-to-Leaderboard☆271Updated 8 months ago
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents☆532Updated this week
- TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycle☆302Updated last month
- Real-Time Detection of Hallucinated Entities in Long-Form Generation☆277Updated 2 months ago
- Agent computer interface for AI software engineer.☆115Updated last month
- Commit0: Library Generation from Scratch☆176Updated 8 months ago
- Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike stat…☆418Updated last week
- LLMProc: Unix-inspired runtime that treats LLMs as processes.☆34Updated 6 months ago
- SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?☆251Updated 3 weeks ago
- ☆68Updated 8 months ago
- Multi-Faceted AI Agent and Workflow Autotuning. Automatically optimizes LangChain, LangGraph, DSPy programs for better quality, lower exe…☆267Updated 8 months ago
- Training setup for Langchain's Open Deep Research☆74Updated 5 months ago
- OpenTinker is an RL-as-a-Service infrastructure for foundation models☆618Updated this week
- An agent benchmark with tasks in a simulated software company.☆631Updated 2 months ago
- ☆59Updated last year
- τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment☆690Updated this week
- Beating the GAIA benchmark with Transformers Agents. 🚀☆145Updated 11 months ago
- ☆136Updated 10 months ago
- ☆237Updated 2 months ago
- Matrix (Multi-Agent daTa geneRation Infra and eXperimentation framework) is a versatile engine for multi-agent conversational data genera…☆260Updated last week
- Verify Precision of all Kimi K2 API Vendor☆501Updated this week