philschmid / ai-agent-benchmark-compendiumLinks
Compendium of over 50 benchmarks for evaluating AI agents, categorized into Function Calling & Tool Use, General Assistant & Reasoning, Coding & Software Engineering, and Computer Interaction.
☆84Updated 3 months ago
Alternatives and similar repositories for ai-agent-benchmark-compendium
Users that are interested in ai-agent-benchmark-compendium are comparing it to the libraries listed below
Sorting:
- ☆37Updated 8 months ago
- ☆39Updated last year
- Leveraging DSPy for AI-driven task understanding and solution generation, the Self-Discover Framework automates problem-solving through r…☆72Updated 2 months ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆107Updated 4 months ago
- Experimental Code for StructuredRAG: JSON Response Formatting with Large Language Models☆114Updated 9 months ago
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive arguments☆99Updated 3 months ago
- Deep research agents using MiniMax M2.1 interleaved thinking☆194Updated last month
- Context Engineering Course with DSPy☆211Updated 6 months ago
- A framework for pitting LLMs against each other in an evolving library of games ⚔☆34Updated 9 months ago
- A mcp server that uses the Osmosis-Apply-1.7B model to apply code merges☆53Updated 6 months ago
- ScreenSuite - The most comprehensive benchmarking suite for GUI Agents!☆135Updated 4 months ago
- Testing paligemma2 finetuning on reasoning dataset☆18Updated last year
- Very minimal (and stateless) agent framework☆44Updated last year
- ☆95Updated last week
- Prompt design in Python☆65Updated last year
- A seamless matchmaking application that is programmed with Cohere Command R+, Stanford NLP DSPy framework, Weaviate Vector store and Crew…☆59Updated last year
- ☆85Updated 4 months ago
- A user interface for DSPy☆210Updated 3 months ago
- ReDel is a toolkit for researchers and developers to build, iterate on, and analyze recursive multi-agent systems. (EMNLP 2024 Demo)☆90Updated last month
- ☆269Updated last week
- ☆94Updated last year
- ☆80Updated last year
- Official code for NeurIPS 2025 paper "AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise"☆125Updated last week
- Official Repo for The Paper "Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems"☆60Updated 11 months ago
- ☆36Updated 11 months ago
- ☆78Updated last month
- alphaxiv open source alternative☆107Updated 8 months ago
- A collection of Compound Retrieval Systems implemented with DSPy and Weaviate.☆94Updated 3 weeks ago
- Public repository containing METR's DVC pipeline for eval data analysis☆186Updated last week
- ☆57Updated 2 years ago