CentreSecuriteIA / BELLS
Benchmarks for the Evaluation of LLM Supervision
☆32Updated 3 weeks ago
Alternatives and similar repositories for BELLS:
Users that are interested in BELLS are comparing it to the libraries listed below
- ControlArena is a suite of realistic settings, mimicking complex deployment environments, for running control evaluations. This is an alp…☆50Updated this week
- Mechanistic Interpretability Visualizations using React☆241Updated 4 months ago
- METR Task Standard☆146Updated 2 months ago
- Collection of evals for Inspect AI☆117Updated this week
- This repository collects all relevant resources about interpretability in LLMs☆341Updated 5 months ago
- 📚 A curated list of papers & technical articles on AI Quality & Safety☆178Updated 2 weeks ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆98Updated 2 months ago
- Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.☆210Updated last year
- ☆71Updated 2 months ago
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆102Updated last year
- ☆42Updated 8 months ago
- ☆68Updated last year
- Tools for studying developmental interpretability in neural networks.☆89Updated 3 months ago
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆90Updated this week
- ☆219Updated 6 months ago
- 🧠 Starter templates for doing interpretability research☆70Updated last year
- A Comprehensive Assessment of Trustworthiness in GPT Models☆284Updated 7 months ago
- Tools for understanding how transformer predictions are built layer-by-layer☆488Updated 10 months ago
- Inspect: A framework for large language model evaluations☆903Updated this week
- Improving Alignment and Robustness with Circuit Breakers☆197Updated 7 months ago
- A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.☆136Updated this week
- Sparse Autoencoder for Mechanistic Interpretability☆241Updated 9 months ago
- ☆128Updated 3 weeks ago
- Machine Learning for Alignment Bootcamp☆25Updated last year
- Erasing concepts from neural representations with provable guarantees☆227Updated 3 months ago
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆197Updated 4 months ago
- Notebooks accompanying Anthropic's "Toy Models of Superposition" paper☆120Updated 2 years ago
- A toolkit for describing model features and intervening on those features to steer behavior.☆178Updated 5 months ago
- Papers about red teaming LLMs and Multimodal models.☆111Updated 5 months ago
- ☆91Updated 2 weeks ago