ToolBench, an evaluation suite for LLM tool manipulation capabilities.
☆178Feb 28, 2024Updated 2 years ago
Alternatives and similar repositories for toolbench
Users that are interested in toolbench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [ICLR'24] MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use☆115Mar 21, 2024Updated 2 years ago
- [COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios☆73May 13, 2025Updated 11 months ago
- ☆920Jul 24, 2024Updated last year
- [ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.☆5,621May 21, 2025Updated 11 months ago
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.☆230Apr 15, 2025Updated last year
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- [NAACL'25] "Revealing the Barriers of Language Agents in Planning"☆13Jun 22, 2025Updated 10 months ago
- This is the repository for paper "CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models"☆30Oct 8, 2023Updated 2 years ago
- ☆31Jun 12, 2024Updated last year
- ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)☆271Apr 18, 2024Updated 2 years ago
- [COLING 2025] NesTools: A Dataset for Evaluating Nested Tool Learning Abilities of Large Language Models☆18Jan 18, 2025Updated last year
- [ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step☆305Apr 3, 2024Updated 2 years ago
- ☆26Nov 19, 2025Updated 5 months ago
- LLM evaluation.☆16Nov 7, 2023Updated 2 years ago
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)☆3,377Feb 8, 2026Updated 2 months ago
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- Source code for paper: Knowledge Inheritance for Pre-trained Language Models☆37Apr 24, 2022Updated 4 years ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆43Feb 15, 2024Updated 2 years ago
- [ACL2024] Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios☆71Aug 5, 2025Updated 9 months ago
- Paper collection on building and evaluating language model agents via executable language grounding☆365Apr 29, 2024Updated 2 years ago
- [ACL 2024] On the Multi-turn Instruction Following for Conversational Web Agents☆17Oct 12, 2024Updated last year
- 📊 A simple command-line utility for querying and monitoring GPU status☆14Aug 3, 2023Updated 2 years ago
- Source code for ACL 2021 paper "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learni…☆85May 26, 2021Updated 4 years ago
- Official repository of Graph RAG-Tool Fusion and ToolLinkOS dataset.☆23Feb 13, 2025Updated last year
- This is the repository for the Tool Learning survey.☆483Aug 9, 2025Updated 8 months ago
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- ☆102Dec 7, 2023Updated 2 years ago
- An end-to-end benchmark suite of multi-modal DNN applications for system-architecture co-design☆22Dec 13, 2024Updated last year
- ☆12Jan 2, 2024Updated 2 years ago
- ☆84Apr 18, 2024Updated 2 years ago
- This is the official implementation for MA-LoT.☆19Aug 4, 2025Updated 9 months ago
- [NAACL'25 🏆 SAC Award] Official code for "Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert…☆16Feb 4, 2025Updated last year
- the official code for "ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases"☆887Oct 26, 2024Updated last year
- [ICML'24 Spotlight] "TravelPlanner: A Benchmark for Real-World Planning with Language Agents"☆510Nov 7, 2025Updated 5 months ago
- The code and data for the paper JiuZhang3.0☆49May 26, 2024Updated last year
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels …☆285Aug 19, 2023Updated 2 years ago
- reStructured Pre-training☆99Dec 22, 2022Updated 3 years ago
- NexusRaven-13B, a new SOTA Open-Source LLM for function calling. This repo contains everything for reproducing our evaluation on NexusRav…☆320Sep 29, 2023Updated 2 years ago
- Repository for NPHardEval, a quantified-dynamic benchmark of LLMs☆64Mar 26, 2024Updated 2 years ago
- ☆25Jun 25, 2019Updated 6 years ago
- Code for "[COLM'25] RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing"☆24Mar 18, 2025Updated last year
- Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)☆12,858Apr 13, 2026Updated 3 weeks ago