☆206Dec 13, 2025Updated 3 months ago
Alternatives and similar repositories for cybench
Users that are interested in cybench are comparing it to the libraries listed below
Sorting:
- The goal of this repo is to become a benchmark for pentesting☆22Oct 25, 2024Updated last year
- The repository of Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning.☆30Sep 8, 2025Updated 6 months ago
- A benchmark for Java gadget chain detecting algorithms.☆15Jun 20, 2025Updated 9 months ago
- The D-CIPHER and NYU CTF baseline LLM Agents built for NYU CTF Bench☆137Oct 25, 2025Updated 4 months ago
- Constructing community of LLM-based Agent in the minecraft☆17Nov 27, 2025Updated 3 months ago
- ☆247Updated this week
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆77Mar 1, 2025Updated last year
- This repo contains the codes of the penetration test benchmark for Generative Agents presented in the paper "AutoPenBench: Benchmarking G…☆71Oct 28, 2025Updated 4 months ago
- ☆126Sep 22, 2025Updated 6 months ago
- XBOW Validation Benchmarks☆523Jun 18, 2025Updated 9 months ago
- AIxCC: automated vulnerability repair via LLMs, search, and static analysis☆11Jul 16, 2024Updated last year
- CyberBench: A Multi-Task Cyber LLM Benchmark☆30Apr 29, 2025Updated 10 months ago
- 软件工程与形式化方法相关前沿工作阅读与分享☆36Oct 27, 2025Updated 4 months ago
- mcp wrapper for openai built-in tools☆12Mar 13, 2025Updated last year
- LLM agent solving traces, leaderboards, and benchmark results across security CTF and hacking platforms☆54Updated this week
- ☆66Sep 13, 2025Updated 6 months ago
- MetricEval: A framework that conceptualizes and operationalizes four main components of metric evaluation, in terms of reliability and va…☆12Nov 6, 2023Updated 2 years ago
- ☆80Feb 11, 2026Updated last month
- ☆43Jan 30, 2023Updated 3 years ago
- Security Vulnerability Repair via Concolic Execution and Code Mutations☆19Sep 12, 2024Updated last year
- FUGIO: Automatic Exploit Generation for PHP Object Injection Vulnerabilities☆98Nov 27, 2023Updated 2 years ago
- [NAACL'25] "Revealing the Barriers of Language Agents in Planning"☆13Jun 22, 2025Updated 9 months ago
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆162May 29, 2025Updated 9 months ago
- 基于kimi-cli二次开发的针对CTF竞赛的专用Agent☆46Dec 3, 2025Updated 3 months ago
- CS-Eval is a comprehensive evaluation suite for fundamental cybersecurity models or large language models' cybersecurity ability.☆60Nov 27, 2024Updated last year
- CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities☆178Jan 14, 2026Updated 2 months ago
- A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.☆488Mar 12, 2026Updated last week
- ☆13Jan 30, 2025Updated last year
- An online AI security course created by UChicago's XLab☆31Feb 21, 2026Updated last month
- Code implementation for paper AbsenceBench: Language Models Can't Tell What's Missing☆17Oct 23, 2025Updated 4 months ago
- AlgZoo: uninterpreted models with fewer than 1,500 parameters☆45Jan 19, 2026Updated 2 months ago
- Industrial Cybersecurity Conference Index☆13Mar 11, 2024Updated 2 years ago
- Collection of evals for Inspect AI☆406Updated this week
- [NeurIPS 2024 D&B] Evaluating Copyright Takedown Methods for Language Models☆17Jul 17, 2024Updated last year
- ☆11Oct 13, 2020Updated 5 years ago
- ☆27Oct 6, 2024Updated last year
- A polyglot static analysis engine for detecting vulnerabilities in scripting languages native extensions based on joern.☆21Sep 1, 2025Updated 6 months ago
- ☆24Jan 27, 2026Updated last month
- CyberGym is a large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on…☆172Feb 23, 2026Updated last month