mlcommons / ailuminateLinks
The AILuminate v1.1 benchmark suite is an AI risk assessment benchmark developed with broad involvement from leading AI companies, academia, and civil society.
☆56Updated 5 months ago
Alternatives and similar repositories for ailuminate
Users that are interested in ailuminate are comparing it to the libraries listed below
Sorting:
- Public repository containing METR's DVC pipeline for eval data analysis☆138Updated 7 months ago
- A better way of testing, inspecting, and analyzing AI Agent traces.☆40Updated last month
- SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?☆217Updated last week
- Prompts used in the Automated Auditing Blog Post☆125Updated 4 months ago
- A subset of jailbreaks automatically discovered by the Haize Labs haizing suite.☆99Updated 7 months ago
- An implementation of Deepmind's Promptbreeder.☆22Updated last year
- Pivotal Token Search☆131Updated 4 months ago
- The Granite Guardian models are designed to detect risks in prompts and responses.☆121Updated last month
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆121Updated 2 weeks ago
- Let Claude control a web browser on your machine.☆39Updated 5 months ago
- Red-Teaming Language Models with DSPy☆238Updated 9 months ago
- explore token trajectory trees on instruct and base models☆148Updated 6 months ago
- Thorn in a HaizeStack test for evaluating long-context adversarial robustness.☆26Updated last year
- A preprint version of our recent research on the capability of frontier AI systems to do self-replication☆58Updated 11 months ago
- A Text-Based Environment for Interactive Debugging☆277Updated this week
- A benchmarking tool for evaluating AI coding assistants on real-world software engineering tasks from the SWE-Bench dataset.☆61Updated 5 months ago
- Multi-language code navigation API in a container☆95Updated 3 months ago
- A library for benchmarking the Long Term Memory and Continual learning capabilities of LLM based agents. With all the tests and code you…☆80Updated 11 months ago
- Accompanying code and SEP dataset for the "Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?" paper.☆57Updated 8 months ago
- A cookiecutter template for creating a new LLM plugin that adds tools to LLM☆27Updated 6 months ago
- LLM plugin for clustering embeddings☆82Updated last year
- 🧬 The Huxley-Gödel Machine☆301Updated this week
- Code for "Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs"☆83Updated 9 months ago
- Using Large Language Models for Repo-wide Type Prediction☆112Updated last year
- Your buddy in the (L)LM space.☆64Updated last year
- Alice in Wonderland code base for experiments and raw experiments data☆131Updated 2 months ago
- Sphynx Hallucination Induction☆53Updated 10 months ago
- Model Context Protocol (MCP) server for constraint optimization and solving"☆140Updated 2 months ago
- This repo tracks the opened and merged PRs by the top SWE coding agents by OpenAI, GitHub, and others. Updates every 3 hours.☆296Updated this week
- A suite of open-ended, non-imitative tasks involving generalizable skills for large language model chatbots and agents to enable bootstra…☆41Updated 10 months ago