philschmid / ai-agent-benchmark-compendiumLinks
Compendium of over 50 benchmarks for evaluating AI agents, categorized into Function Calling & Tool Use, General Assistant & Reasoning, Coding & Software Engineering, and Computer Interaction.
☆58Updated last month
Alternatives and similar repositories for ai-agent-benchmark-compendium
Users that are interested in ai-agent-benchmark-compendium are comparing it to the libraries listed below
Sorting:
- Leveraging DSPy for AI-driven task understanding and solution generation, the Self-Discover Framework automates problem-solving through r…☆72Updated 3 weeks ago
- ☆36Updated 6 months ago
- A collection of Compound Retrieval Systems implemented with DSPy and Weaviate.☆91Updated last month
- Official code for NeurIPS 2025 paper "AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise"☆95Updated this week
- ☆36Updated 9 months ago
- ☆83Updated 2 months ago
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive arguments☆93Updated last month
- Prompt design in Python☆63Updated last year
- Comparing retrieval abilities from GPT4-Turbo and a RAG system on a toy example for various context lengths☆35Updated last year
- ☆90Updated 10 months ago
- Very minimal (and stateless) agent framework☆45Updated 10 months ago
- ScreenSuite - The most comprehensive benchmarking suite for GUI Agents!☆132Updated 2 months ago
- ☆24Updated 10 months ago
- A seamless matchmaking application that is programmed with Cohere Command R+, Stanford NLP DSPy framework, Weaviate Vector store and Crew…☆59Updated last year
- Minimal agent runtime built with DSPy modules and a thin Python loop. Includes CLI, FastAPI server, and eval harness with OpenAI/Ollama s…☆65Updated 2 months ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆106Updated 2 months ago
- ReDel is a toolkit for researchers and developers to build, iterate on, and analyze recursive multi-agent systems. (EMNLP 2024 Demo)☆88Updated this week
- ☆57Updated 2 years ago
- Experimental Code for StructuredRAG: JSON Response Formatting with Large Language Models☆113Updated 7 months ago
- ☆88Updated 3 weeks ago
- A user interface for DSPy☆198Updated last month
- ☆80Updated last year
- A mcp server that uses the Osmosis-Apply-1.7B model to apply code merges☆53Updated 4 months ago
- Python library to use Pleias-RAG models☆67Updated 6 months ago
- Code for our paper PAPILLON: PrivAcy Preservation from Internet-based and Local Language MOdel ENsembles☆60Updated 6 months ago
- Simple Graph Memory for AI applications☆89Updated 6 months ago
- a version of baby agi using dspy and typed predictors☆17Updated last year
- ☆73Updated 10 months ago
- ☆50Updated 3 months ago
- A framework for evaluating function calls made by LLMs☆40Updated last year