CentreSecuriteIA / BELLSLinks
Benchmarks for the Evaluation of LLM Supervision
☆32Updated 2 months ago
Alternatives and similar repositories for BELLS
Users that are interested in BELLS are comparing it to the libraries listed below
Sorting:
- ControlArena is a suite of realistic settings, mimicking complex deployment environments, for running control evaluations. This is an alp…☆61Updated this week
- Collection of evals for Inspect AI☆144Updated this week
- ☆44Updated 10 months ago
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆106Updated last year
- 📚 A curated list of papers & technical articles on AI Quality & Safety☆182Updated last month
- METR Task Standard☆148Updated 4 months ago
- ☆70Updated 2 years ago
- Mechanistic Interpretability Visualizations using React☆253Updated 5 months ago
- ☆75Updated 3 months ago
- ☆14Updated this week
- This repository collects all relevant resources about interpretability in LLMs☆353Updated 7 months ago
- An open-source compliance-centered evaluation framework for Generative AI models☆152Updated last month
- ☆131Updated 2 months ago
- Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.☆214Updated last year
- JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]☆347Updated 2 months ago
- ☆96Updated 2 months ago
- 🧠 Starter templates for doing interpretability research☆69Updated last year
- Sparse Autoencoder for Mechanistic Interpretability☆248Updated 10 months ago
- Machine Learning for Alignment Bootcamp☆25Updated last year
- Inference API for many LLMs and other useful tools for empirical research☆48Updated last week
- Python package for measuring memorization in LLMs.☆156Updated 6 months ago
- A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.☆175Updated this week
- Improving Alignment and Robustness with Circuit Breakers☆209Updated 8 months ago
- A Comprehensive Assessment of Trustworthiness in GPT Models☆294Updated 8 months ago
- ☆10Updated 10 months ago
- ☆223Updated 8 months ago
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [ICLR 2025]☆313Updated 4 months ago
- A repository of Language Model Vulnerabilities and Exposures (LVEs).☆110Updated last year
- This is an open-source tool to assess and improve the trustworthiness of AI systems.☆92Updated 3 weeks ago
- Fairness toolkit for pytorch, scikit learn and autogluon☆32Updated 5 months ago