philschmid/ai-agent-benchmark-compendium

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/philschmid/ai-agent-benchmark-compendium)

philschmid / ai-agent-benchmark-compendium

Compendium of over 50 benchmarks for evaluating AI agents, categorized into Function Calling & Tool Use, General Assistant & Reasoning, Coding & Software Engineering, and Computer Interaction.

☆166

Alternatives and similar repositories for ai-agent-benchmark-compendium

Users that are interested in ai-agent-benchmark-compendium are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

philschmid / self-learning-skill
View on GitHub
☆82Feb 6, 2026Updated 5 months ago
firstbatchxyz / dria-agent
View on GitHub
powerful and fast tool calling agents
☆78Mar 19, 2025Updated last year
gemini-cli-extensions / cloud-sql-postgresql
View on GitHub
Skills for Cloud SQL for PostgreSQL
☆40Updated this week
alpha912 / codebase-md
View on GitHub
CodebaseMD: A VS Code extension that converts codebases into structured Markdown documentation, optimized for LLMs and agentic coding too…
☆15May 22, 2025Updated last year
AdoHaha / dspy_fun
View on GitHub
An introduction to DSPy
☆33Aug 30, 2025Updated 10 months ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
run-llama / llama-ui
View on GitHub
☆28Updated this week
DandyQi / MaskedCRF
View on GitHub
☆44Sep 26, 2021Updated 4 years ago
KingOfTheAce2 / codex-flow
View on GitHub
☆25Sep 10, 2025Updated 9 months ago
agora-protocol / python
View on GitHub
Python interface for Agora
☆68Mar 8, 2025Updated last year
lightblue-tech / lb-reranker
View on GitHub
☆24Jan 30, 2025Updated last year
briancavalier / most-behavior
View on GitHub
You're probably looking for https://github.com/briancavalier/most-behave instead
☆11Jul 19, 2018Updated 7 years ago
krypticmouse / dspy-docs
View on GitHub
Official Documentation for DSPy Library
☆24Updated this week
seohyunwoo-0407 / GAR
View on GitHub
FinanceRAG project by KAIST students. Advanced Retrieval-Augmented Generation (RAG) system designed for the financial domain.
☆16Feb 11, 2025Updated last year
pChitral / ETL-SEC-EDGAR-10-k-Filings
View on GitHub
ETL-10-K-Filings is a Python-based open-source project designed for ETL of financial data from SEC Edgar filings. Focusing on the MDA Sec…
☆17Feb 11, 2024Updated 2 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
tumarkin / edgar
View on GitHub
A command line utility to locally index and download filings from the SEC Edgar database.
☆13Feb 18, 2025Updated last year
Genekkion / theHermit
View on GitHub
A quick fix model for the Charm BubbleTea ecosystem.
☆16Nov 27, 2025Updated 7 months ago
john-friedman / txt2dataset
View on GitHub
Convert unstructured text into structured datasets
☆27Apr 15, 2026Updated 2 months ago
cassimons / healthfutures-evagg
View on GitHub
☆29Aug 25, 2025Updated 10 months ago
willer / claude-fsd
View on GitHub
Run Claude Code (and codex) to generate a project plan, then run them in a loop for days until they're done
☆14Jan 18, 2026Updated 5 months ago
rasyosef / splade-index
View on GitHub
Fast search index for SPLADE sparse retrieval models implemented in Python using Numpy and Numba
☆38Oct 16, 2025Updated 8 months ago
Blkalkin / Optimal-TestTime
View on GitHub
☆10Mar 24, 2025Updated last year
fastxyz / skill-optimizer
View on GitHub
Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
☆69May 28, 2026Updated last month
TuanaCelik / text-to-sql-snowflake-llamaindex
View on GitHub
Advanced Text2SQL with LlamaIndex and Snowflake models
☆44Oct 9, 2025Updated 8 months ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
gfodor / tiny-simple-peer
View on GitHub
📡 Simple peer that does not rely upon node polyfills
☆20Oct 26, 2023Updated 2 years ago
couger-inc / cream
View on GitHub
zkCREAM is zk-SNARK based anonymized voting application using a token mixer
☆39Feb 18, 2022Updated 4 years ago
Animadversio / GPT-Auto-Data-Analytics
View on GitHub
Automatize local data analysis with team of tool-using GPT agents
☆17Apr 1, 2024Updated 2 years ago
bergant / xbrlus
View on GitHub
R interface to XBRL US API
☆22Feb 22, 2018Updated 8 years ago
hamelsmu / hamel
View on GitHub
General Utilities
☆57Jun 21, 2026Updated 2 weeks ago
valstro / markdown-rules-mcp
View on GitHub
☆26Jun 12, 2025Updated last year
ornicar / chess.js
View on GitHub
chess.js trimmed down with chess960 support, for lichess.org
☆10Aug 6, 2016Updated 9 years ago
stutrek / cross-tab-middleware
View on GitHub
Redux middleware for sending actions across open browser tabs
☆14May 23, 2017Updated 9 years ago
joesimmons / YouTube---Auto-Buffer---Auto-HD
View on GitHub
Buffers the video without autoplaying and puts it in HD if the option is on. For Firefox, Opera, & Chrome
☆15Jul 17, 2015Updated 10 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
kyleliang919 / Super_Muon
View on GitHub
☆69Mar 21, 2025Updated last year
kargarisaac / dspy_gepa_optimization
View on GitHub
☆28Sep 8, 2025Updated 9 months ago
LinkedInLearning / advanced-rag-applications-with-vector-databases-3886256
View on GitHub
This repo is for LinkedIn Learning course: Advanced RAG Applications with Vector Databases
☆32Oct 17, 2024Updated last year
JigsawStack / jigsawstack-python
View on GitHub
Jigsawstack Python SDK
☆20Jun 3, 2026Updated last month
Marker-Inc-Korea / CoT-llama2
View on GitHub
Chain-of-thought 방식을 활용하여 llama2를 fine-tuning
☆10Nov 18, 2023Updated 2 years ago
moonshinelabs-ai / skipper-tool
View on GitHub
Let Claude Code and Codex control your browser
☆30Aug 30, 2025Updated 10 months ago
scalefocus / virusafe-backend
View on GitHub
The repo for the ViruSafe Backend project.
☆11Jan 21, 2022Updated 4 years ago