CentreSecuriteIA / BELLSLinks
Benchmarks for the Evaluation of LLM Supervision
☆32Updated last month
Alternatives and similar repositories for BELLS
Users that are interested in BELLS are comparing it to the libraries listed below
Sorting:
- ControlArena is a collection of settings, model organisms and protocols - for running control experiments.☆90Updated this week
- Collection of evals for Inspect AI☆215Updated this week
- Inspect: A framework for large language model evaluations☆1,293Updated this week
- METR Task Standard☆159Updated 7 months ago
- Mechanistic Interpretability Visualizations using React☆282Updated 8 months ago
- The nnsight package enables interpreting and manipulating the internals of deep learned models.☆646Updated this week
- Sparse Autoencoder for Mechanistic Interpretability☆260Updated last year
- ☆74Updated 2 years ago
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆112Updated last year
- ☆81Updated 6 months ago
- ☆45Updated last year
- Stanford NLP Python library for understanding and improving PyTorch models via interventions☆800Updated this week
- ☆16Updated this week
- Red-Teaming Language Models with DSPy☆211Updated 6 months ago
- A toolkit for describing model features and intervening on those features to steer behavior.☆198Updated 9 months ago
- A fast + lightweight implementation of the GCG algorithm in PyTorch☆278Updated 3 months ago
- ☆57Updated last month
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal☆713Updated last year
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆262Updated 2 months ago
- ☆238Updated 11 months ago
- ☆201Updated 5 months ago
- Tools for understanding how transformer predictions are built layer-by-layer☆521Updated 3 weeks ago
- open source interpretability platform 🧠☆375Updated this week
- ☆31Updated this week
- Sparsify transformers with SAEs and transcoders☆613Updated last week
- An open-source compliance-centered evaluation framework for Generative AI models☆161Updated last week
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆216Updated 8 months ago
- Training Sparse Autoencoders on Language Models☆935Updated this week
- Decoder only transformer, built from scratch with PyTorch☆31Updated last year
- Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.☆225Updated 3 weeks ago