benchflow-ai/skillsbench

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/benchflow-ai/skillsbench)

benchflow-ai / skillsbench

SkillsBench evaluates how well skills work and how effective agents are at using them.

☆1,502

Alternatives and similar repositories for skillsbench

Users that are interested in skillsbench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

harbor-framework / harbor
View on GitHub
Framework for evaluating and improving agents
☆3,195Updated this week
claw-eval / claw-eval
View on GitHub
Claw-Eval is an evaluation harness for evaluating LLM as agents. All tasks verified by humans.
☆717May 17, 2026Updated 2 months ago
benchflow-ai / ClawsBench
View on GitHub
Repository for results and data (coming soon!) for ClawsBench
☆30Apr 8, 2026Updated 3 months ago
hkust-nlp / Toolathlon
View on GitHub
[ICLR 2026] The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
☆428Updated this week
GeniusHTX / SWE-Skills-Bench
View on GitHub
The official repo of our paper, "SWE-Skills-Bench:Do Agent Skills Actually Help in Real-World Software Engineering?"
☆56Jun 17, 2026Updated 3 weeks ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
ynulihao / AgentSkillOS
View on GitHub
Build your agent from 200,000+ skills via skill RETRIEVAL & ORCHESTRATION
☆550Mar 7, 2026Updated 4 months ago
zjunlp / SkillNet
View on GitHub
Create, Evaluate, and Connect AI Skills
☆1,100Updated this week
InternLM / WildClawBench
View on GitHub
An in-the-wild benchmark for AI agents in the OpenClaw Environment.
☆476Updated this week
Gen-Verse / OpenClaw-RL
View on GitHub
OpenClaw-RL: Train any agent simply by talking
☆5,573May 23, 2026Updated last month
Zhang-Henry / CoEvoSkills
View on GitHub
CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification
☆51Apr 11, 2026Updated 3 months ago
harbor-framework / terminal-bench-3
View on GitHub
Measuring agents' ability to get work done on a computer
☆311Updated this week
sierra-research / tau2-bench
View on GitHub
τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
☆1,585Updated this week
aisa-group / PostTrainBench
View on GitHub
Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours
☆453Updated this week
zhengyanzhao1997 / SkillRouter
View on GitHub
SkillRouter: Retrieve-and-Rerank Skill Selection for LLM Agents at Scale
☆208Jun 25, 2026Updated 3 weeks ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
sentient-agi / EvoSkill
View on GitHub
EvoSkill — An open-source framework that automatically discovers and synthesizes reusable agent skills from failed trajectories to improv…
☆1,038Jul 6, 2026Updated last week
Snowflake-Labs / agent-world-model
View on GitHub
Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning
☆409May 28, 2026Updated last month
pinchbench / skill
View on GitHub
PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https://kilo.ai
☆1,278Jul 2, 2026Updated 2 weeks ago
UCSB-NLP-Chang / Skill-Usage
View on GitHub
☆45Apr 8, 2026Updated 3 months ago
davidliuk / graph-of-skills
View on GitHub
Dependency-Aware Structural Retrieval for Massive Agent Skills
☆186May 4, 2026Updated 2 months ago
Tencent-Hunyuan / CL-bench
View on GitHub
CL-bench: A Benchmark for Context Learning
☆566May 12, 2026Updated 2 months ago
NVIDIA-NeMo / ProRL-Agent-Server
View on GitHub
Agentic RL on Any Harness at Scale
☆668Updated this week
verl-project / verl
View on GitHub
verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework
☆22,482Updated this week
aiming-lab / SkillRL
View on GitHub
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
☆887May 17, 2026Updated last month
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
stanford-iris-lab / meta-harness-tbench2-artifact
View on GitHub
Meta-Harness: 76.4% on Terminal-Bench 2.0 (Claude Opus 4.6)
☆1,144Mar 26, 2026Updated 3 months ago
cxcscmu / SkillLearnBench
View on GitHub
[COLM'26] SkillLearnBench is the first benchmark for evaluating continual learning methods that automatically generate agent skills.
☆70Jul 9, 2026Updated last week
Gen-Verse / Open-AgentRL
View on GitHub
RLAnything (ICML 2026) & AutoTool (ICML 2026), DemyAgent: Open-Source RL for LLMs and Agentic Scenarios
☆581Jun 12, 2026Updated last month
metaevo-ai / meta-context-engineering
View on GitHub
[ICML 2026] Meta Context Engineering via Agentic Skill Evolution
☆139May 4, 2026Updated 2 months ago
RUC-NLPIR / EnvScaler
View on GitHub
The official implementation of "EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis".
☆175Feb 12, 2026Updated 5 months ago
harbor-framework / terminal-bench
View on GitHub
A benchmark for LLMs on complicated tasks in the terminal
☆2,455Updated this week
ViktorAxelsen / MemSkill
View on GitHub
MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents
☆544May 23, 2026Updated last month
ace-agent / ace
View on GitHub
Evolve your language agent with Agentic Context Engineering (ACE)
☆1,211May 19, 2026Updated last month
ZJU-REAL / SkillZero
View on GitHub
Official code for "SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization"
☆347May 20, 2026Updated last month
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
ZhangZi-a / SkillFlow
View on GitHub
☆40May 12, 2026Updated 2 months ago
open-thoughts / OpenThoughts-Agent
View on GitHub
Data recipes and robust infrastructure for training AI agents
☆259Updated this week
ThakiCloud / SKILLRET
View on GitHub
Skill retrieval benchmark dataset and evaluation code.
☆20May 8, 2026Updated 2 months ago
PeterGriffinJin / Search-R1
View on GitHub
Search-R1: An Efficient, Scalable RL Training Framework for Reasoning & Search Engine Calling interleaved LLM based on veRL
☆5,107Nov 13, 2025Updated 8 months ago
thunlp / OPD
View on GitHub
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
☆808Jun 29, 2026Updated 2 weeks ago
harbor-framework / terminal-bench-2
View on GitHub
☆333Apr 30, 2026Updated 2 months ago
harbor-framework / terminal-bench-science
View on GitHub
Terminal-Bench Science: Evaluating AI Agents on Complex Real-World Scientific Workflows in the Terminal
☆185Updated this week