benchflow-ai / benchflowLinks
AI benchmark runtime framework that allows you to integrate and evaluate AI tasks using Docker-based benchmarks.
☆164Updated 5 months ago
Alternatives and similar repositories for benchflow
Users that are interested in benchflow are comparing it to the libraries listed below
Sorting:
- OSS RL environment + evals toolkit☆198Updated this week
- Verify Precision of all Kimi K2 API Vendor☆340Updated last week
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆352Updated this week
- ⚖️ Awesome LLM Judges ⚖️☆132Updated 6 months ago
- Prompt-to-Leaderboard☆260Updated 6 months ago
- Agent computer interface for AI software engineer.☆110Updated last month
- Challenges for general-purpose web-browsing AI agents☆66Updated 5 months ago
- ☆170Updated 8 months ago
- proof-of-concept of Cursor's Instant Apply feature☆84Updated last year
- Letting Claude Code develop his own MCP tools :)☆123Updated 8 months ago
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents☆442Updated this week
- Together Open Deep Research☆353Updated 6 months ago
- An Agentic Deep Research Assistant similar to Gemini and OpenAI Deep Research☆125Updated 8 months ago
- Commit0: Library Generation from Scratch☆171Updated 6 months ago
- Super basic implementation (gist-like) of RLMs with REPL environments.☆242Updated 3 weeks ago
- Routing on Random Forest (RoRF)☆218Updated last year
- Beating the GAIA benchmark with Transformers Agents. 🚀☆138Updated 8 months ago
- ☆231Updated 4 months ago
- TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycle☆299Updated this week
- [ACL 2024] Do Large Language Models Latently Perform Multi-Hop Reasoning?☆80Updated 7 months ago
- AWM: Agent Workflow Memory☆343Updated 9 months ago
- τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment☆394Updated this week
- Real-Time Detection of Hallucinated Entities in Long-Form Generation☆264Updated 3 weeks ago
- Claude Deep Research config for Claude Code.☆223Updated 7 months ago
- A system that tries to resolve all issues on a github repo with OpenHands.☆114Updated 11 months ago
- Prompt design in Python☆63Updated 11 months ago
- craft post-training data recipes☆26Updated last week
- Solving data for LLMs - Create quality synthetic datasets!☆150Updated 9 months ago
- [NAACL2025] LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications☆127Updated 3 months ago
- The Open Deep Research app – generate reports with OSS LLMs☆303Updated this week