Ranking LLMs on agentic tasks
☆222May 21, 2026Updated 3 weeks ago
Alternatives and similar repositories for agent-leaderboard
Users that are interested in agent-leaderboard are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- NeurIPS 2024: SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation☆13May 24, 2025Updated last year
- ☆22Nov 4, 2024Updated last year
- A Model Context Protocol server for Postgres☆25Jan 5, 2026Updated 5 months ago
- ☆63Jun 2, 2026Updated last week
- Examples of using Galileo for better ML data quality!!☆13Feb 5, 2026Updated 4 months ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- 📝The official repository of "Rethinking Cross-Generator Image Forgery Detection through DINOv3"☆25Dec 2, 2025Updated 6 months ago
- Baker is an AI powered app that helps you find recipes and avoid food waste☆14Jan 4, 2025Updated last year
- A tool to assist in the interpretation of learned features in sparse autoencoders (in particular the four SAE's trained by Joseph Bloom o…☆19Oct 4, 2024Updated last year
- ☆23Oct 28, 2024Updated last year
- MLGym A New Framework and Benchmark for Advancing AI Research Agents☆603Aug 10, 2025Updated 10 months ago
- AgentQL's integrations with workflow automation tools and AI agent frameworks let you extract structured data from web pages using querie…☆27Updated this week
- Security-native LLM system for AI-generated application security.☆263Jun 4, 2026Updated last week
- Official code repo for paper "Great Memory, Shallow Reasoning: Limits of kNN-LMs"☆24Apr 30, 2025Updated last year
- 欢迎来到"5分钟上手Strands系列"教程!这是一个专注于提升用户和开发者构建AI Agent能力的系列教程。通过简洁的5分钟教程形式,帮助您快速掌握Strands Agent的设计、开发、集成和部署流程。☆22Dec 8, 2025Updated 6 months ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Python Server for C3 AI app. A project that brings the power of Large Language Models (LLM) and Retrieval-Augmented Generation (RAG) with…☆24Jan 7, 2024Updated 2 years ago
- Put your data somewhere you can look at it☆31Jun 9, 2025Updated last year
- "Syntriever: How to Train Your Retriever with Synthetic Data from LLMs" the Nations of the Americas Chapter of the Association for Comput…☆29Mar 5, 2025Updated last year
- Microsoft Graph CLI - Mail, Calendar, OneDrive, To-Do, Contacts☆64Mar 6, 2026Updated 3 months ago
- [KDD24-ADS] R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models☆11Apr 9, 2024Updated 2 years ago
- MCP orchestrator that converts MPC servers to agents.☆25Jun 9, 2026Updated last week
- Associated code for the Quickstart tutorial☆17Aug 18, 2023Updated 2 years ago
- Course Material☆20Aug 11, 2025Updated 10 months ago
- Large Language Models for the Terminal☆17Dec 11, 2023Updated 2 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- meta_llama_2finetuned_text_generation_summarization☆21Jul 21, 2023Updated 2 years ago
- This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software E…☆1,439Jul 18, 2025Updated 10 months ago
- A Structured Output Benchmark whose 'ground-truth' is actually right☆19Dec 5, 2025Updated 6 months ago
- AgentFence is an open-source platform for automatically testing AI agent security. It identifies vulnerabilities such as prompt injection…☆55Mar 6, 2025Updated last year
- An agentic AI application that allows you to chat with your papers and gather also information from papers on ArXiv and on PubMed☆154May 18, 2025Updated last year
- ☆19Oct 23, 2024Updated last year
- Create Vector Store from Scratch in pure Python.☆13Dec 15, 2023Updated 2 years ago
- [NAACL'25] "Revealing the Barriers of Language Agents in Planning"☆13Jun 22, 2025Updated 11 months ago
- ☆44Dec 14, 2024Updated last year
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- Stanford CS224W: Machine Learning with Graphs (GNN)☆12Sep 6, 2022Updated 3 years ago
- Distributed IO-aware Attention algorithm☆24Sep 24, 2025Updated 8 months ago
- Code and data for the paper: IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Large Language Models …☆12Apr 27, 2024Updated 2 years ago
- LLM Building Blocks for Python Course☆17Nov 17, 2025Updated 6 months ago
- "DeepResearch-Eval: An End-to-End Evaluation Framework for DeepResearch Systems"☆47Oct 16, 2025Updated 8 months ago
- T2Ranking: A large-scale Chinese benchmark for passage ranking.☆163Jul 3, 2023Updated 2 years ago
- Lightweight hallucination detection framework for RAG applications☆577Updated this week