SeraphimSerapis/tool-eval-bench

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/SeraphimSerapis/tool-eval-bench)

SeraphimSerapis / tool-eval-bench

Tool-calling quality benchmark for LLM serving stacks. 80+ deterministic scenarios testing multi-turn orchestration, safety boundaries, and structured output. Supports vLLM, SGLang, and llama.cpp.

☆247

Alternatives and similar repositories for tool-eval-bench

Users that are interested in tool-eval-bench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

eugr / llama-benchy
View on GitHub
llama-benchy - llama-bench style benchmarking tool for all backends
☆589Jul 10, 2026Updated 2 weeks ago
whpthomas / spark-auto-round
View on GitHub
☆17Jun 27, 2026Updated 3 weeks ago
RobTand / prismaquant
View on GitHub
Mixed-precision quantization for LLMs. Every layer refracts into a different format based on its sensitivity. Native compressed-tensors e…
☆98Updated this week
niklasfrick / spark-dashboard
View on GitHub
Real-time hardware and LLM inference monitoring — GPU, CPU, memory, and vLLM metrics streamed to a dashboard.
☆84Updated this week
eugr / spark-vllm-docker
View on GitHub
Docker configuration for running VLLM on dual DGX Sparks
☆1,886Updated this week
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
spark-arena / sparkrun
View on GitHub
sparkrun - launch, manage, and stop LLM inference workloads on NVIDIA DGX Spark systems
☆406Updated this week
antheas / spark_hwmon
View on GitHub
Linux hwmon driver for the NVIDIA DGX Spark (GB10 SoC) that exposes full system power telemetry via standard sensors / sysfs interfaces.
☆25Mar 2, 2026Updated 4 months ago
DanTup / spark-evals
View on GitHub
Some benchmark results of small models and quants that fit on DGX Spark
☆47Updated this week
spark-arena / recipe-registry
View on GitHub
Official Spark Arena Recipe Registry
☆52Jun 13, 2026Updated last month
Plaaasma / FlashQLA-Blackwell
View on GitHub
FlashQLA TileLang GDN kernels ported to NVIDIA Blackwell consumer (GB10 / DGX Spark)
☆17Jun 5, 2026Updated last month
tonyd2wild / DeepSeek-v4-Flash-DSpark-1M-NVFP4-KV-2x-DGX-Spark
View on GitHub
DeepSeek V4 Flash DSpark 1M NVFP4 KV recipe for 2x DGX Spark
☆155Jul 16, 2026Updated last week
AEON-7 / vllm-ultimate-dgx-spark
View on GitHub
AEON vLLM Ultimate — vLLM 0.25.0 built from source for DGX Spark / Blackwell (sm_121a/GB10). One image serves the whole AEON fleet (Gemma…
☆98Jul 17, 2026Updated last week
0rand / DeepSeek-v4-DSpark-Aidendle94-GB10-ServingStack
View on GitHub
Docker compose serving stack for DeepSeek v4 Flash DSpark for NVIDIA Spark GB10 system using Aidendle94 image
☆19Jul 8, 2026Updated 2 weeks ago
Entrpi / qwen3.5-122B-A10B-on-spark
View on GitHub
Qwen3.5-122B-A10B on a DGX Spark with DFlash speculative decode. One-shot Docker/vLLM installer. 80+ tok/s!
☆45Jun 29, 2026Updated 3 weeks ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
drowzeys / keys-vLLm-0.24.0-Optimized-DeepSeekV4-Flash-DSpark-NVFP4-KV-1.5M-CTX-3M-Pool-C-12-on-2-DGX-Spark
View on GitHub
Run on TWO-DGX-Spark - vLLm-0.24.0 dual cache optimized DSV4F+DSpark+NVFP4 KV (Concurrency 12 with 1.5M context/3M KV token Pool) >0.58-0…
☆20Jul 7, 2026Updated 2 weeks ago
albond / DGX_Spark_Qwen3.5-122B-A10B-AR-INT4
View on GitHub
Qwen3.5-122B-A10B on DGX Spark: 28.3 → 51 tok/s (+80%)
☆283Jun 2, 2026Updated last month
MiaAI-Lab / DeepSeek-v4-Flash-DSpark-2x-DGX-Spark
View on GitHub
☆166Updated this week
parallelArchitect / sparkview
View on GitHub
Operator-grade GPU monitor for NVIDIA GPUs with native GB10 / DGX Spark coherent UMA support — PSI pressure, clock detection, ConnectX-7 …
☆23May 31, 2026Updated last month
MiaAI-Lab / DeepSeek-V4-Flash-Dual-DGX-Spark-1M-Context
View on GitHub
Deploy DeepSeek V4 Flash (MoE reasoning model) on dual DGX Spark nodes with 1M token context, InfiniBand, and FP8 KV-cache
☆84Jul 9, 2026Updated 2 weeks ago
ateska / dgx-spark-prometheus
View on GitHub
A Prometheus metrics exporter for NVIDIA DGX Spark clusters.
☆18Feb 16, 2026Updated 5 months ago
mcampa / sparkrun-ui
View on GitHub
Web UI for sparkrun — launch and monitor inference workloads on NVIDIA DGX Spark
☆22Jun 16, 2026Updated last month
Avarok-Cybersecurity / dgx-vllm
View on GitHub
A dedicated effort to make an optimized, bleeding edge vLLM image using Docker to support DGX comprehensively
☆124Feb 22, 2026Updated 5 months ago
AEON-7 / Aeon-Bench-Pod
View on GitHub
Run the AEON Bench suite on your own hardware: verified HuggingFace pull → serve → benchmark (text · agentic ×3 harnesses · vision · audi…
☆21Updated this week
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
namake-taro / vllm-custom
View on GitHub
☆20Apr 7, 2026Updated 3 months ago
technigmaai / dgx-spark
View on GitHub
☆19May 31, 2026Updated last month
joeynyc / spark-doctor
View on GitHub
Local diagnostic CLI for NVIDIA DGX Spark (GB10). Detects power caps, UMA pressure, thermal risk, CUDA 13/SM_121 wheel mismatches, Docker…
☆92Jul 11, 2026Updated last week
AEON-7 / Qwen3.6-35B-A3B-heretic-NVFP4-DFlash
View on GitHub
Qwen3.6-35B-A3B-heretic NVFP4 + DFlash speculative decoding on DGX Spark (GB10/sm_121a). Source-built vLLM image + 7 patches + comprehens…
☆129Jun 28, 2026Updated 3 weeks ago
Entrpi / ds4-spark-vllm
View on GitHub
antirez/ds4-style hybrid quant DeepSeek V4 Flash on a single DGX Spark via vLLM
☆15May 11, 2026Updated 2 months ago
notwitcheer / llm-bench-rig
View on GitHub
Dual-engine (llama.cpp + vLLM) LLM benchmarking pipeline for GGUF & safetensors on NVIDIA GPUs — speed, quality, live dashboard, publisha…
☆24Updated this week
stevibe / local-screen-agent
View on GitHub
☆68Jun 4, 2026Updated last month
calico88x / DGX-Model-Manager
View on GitHub
Single-file web UI for NVIDIA DGX Spark — pull Ollama models, browse and download from HuggingFace, manage LiteLLM routing, and control S…
☆28May 19, 2026Updated 2 months ago
hikarioyama / qwen36-a6b
View on GitHub
☆27Jul 16, 2026Updated last week
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
Triplany / comfyui-dgx-spark
View on GitHub
comfyui optimizations for the dgx spark
☆28Apr 30, 2026Updated 2 months ago
AEON-7 / vllm-dflash
View on GitHub
DFlash vLLM for DGX Spark — Plug & Play Block-Diffusion Speculative Decoding
☆52Jun 28, 2026Updated 3 weeks ago
Avarok-Cybersecurity / atlas
View on GitHub
Pure Rust Inference Engine
☆609Updated this week
AEON-7 / Ornith-1.0-35B-AEON-Ultimate-Uncensored
View on GitHub
Uncensored/abliterated Ornith-1.0-35B (AEON Ultimate): 0% refusal, 0 coding-capability loss. BF16 + FP8 for vLLM.
☆67Jun 29, 2026Updated 3 weeks ago
MiaAI-Lab / Unsloth-Qwen3.6-35b-NVFP4-DGX-Spark
View on GitHub
vLLM deployment for Unsloth Qwen3.6-35B-A3B-NVFP4-Fast on NVIDIA DGX Spark
☆25Jul 11, 2026Updated last week
kreuzhofer / dgx-spark-unsloth-qwen3.5-training
View on GitHub
bf16 LoRA fine-tuning of [Qwen3.5-35B-A3B](https://huggingface.co/unsloth/Qwen3.5-35B-A3B) (a 35B-total / 3B-active Mixture-of-Experts vi…
☆15Mar 12, 2026Updated 4 months ago
AEON-7 / Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash
View on GitHub
Fully uncensored, capability-enhanced abliteration of Qwen3.6-27B. NVFP4 + z-lab DFlash speculative decoding (n=12) on the unified ghcr.i…
☆424Jul 3, 2026Updated 3 weeks ago