benchflow-ai / benchflow
AI benchmark runtime framework that allows you to integrate and evaluate AI tasks using Docker-based benchmarks.
☆141Updated last week
Alternatives and similar repositories for benchflow:
Users that are interested in benchflow are comparing it to the libraries listed below
- Solving data for LLMs - Create quality synthetic datasets!☆146Updated 3 months ago
- CodeScientist: An automated scientific discovery system for code-based experiments☆237Updated last month
- Commit0: Library Generation from Scratch☆144Updated 3 weeks ago
- Kura is a simple reproduction of the CLIO paper which uses language models to label user behaviour before clustering them based on embedd…☆102Updated 3 weeks ago
- ⚖️ Awesome LLM Judges ⚖️☆94Updated last week
- ☆147Updated 2 months ago
- A prompting library☆163Updated 7 months ago
- 🦀️ CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents. https://crab.camel-ai.org/☆337Updated this week
- II-Researcher: a new open-source framework designed to aid building search / research agents☆246Updated last week
- AWM: Agent Workflow Memory☆268Updated 3 months ago
- Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"☆512Updated last month
- [ACL 2024] Do Large Language Models Latently Perform Multi-Hop Reasoning?☆65Updated last month
- ☆164Updated last week
- Letting Claude Code develop his own MCP tools :)☆98Updated last month
- Claude Deep Research config for Claude Code.☆170Updated last month
- Verdict is a library for scaling judge-time compute.☆202Updated this week
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym☆448Updated last month
- ☆79Updated 2 weeks ago
- List of Open Source projects built on Browser Use☆57Updated this week
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆170Updated this week
- Hallucination Detector is a free and open-source tool that helps you verify the accuracy of your LLM generated content instantly.☆204Updated 3 months ago
- ☆85Updated 7 months ago
- Prompt design in Python☆57Updated 5 months ago
- 🍎APPL: A Prompt Programming Language. Seamlessly integrate LLMs with programs.☆247Updated 2 months ago
- Agent computer interface for AI software engineer.☆68Updated this week
- Prompt engineering, automated.☆304Updated 2 weeks ago
- An agent benchmark with tasks in a simulated software company.☆320Updated 3 weeks ago
- Atom of Thoughts for Markov LLM Test-Time Scaling☆560Updated last week
- Beating the GAIA benchmark with Transformers Agents. 🚀☆113Updated 2 months ago
- Atropos is a Language Model Reinforcement Learning Environments framework for collecting and evaluating LLM trajectories through diverse …☆171Updated this week