zroe1 / xlab-ai-securityLinks
An online AI security course created by UChicago's XLab
☆29Updated last month
Alternatives and similar repositories for xlab-ai-security
Users that are interested in xlab-ai-security are comparing it to the libraries listed below
Sorting:
- Training Sparse Autoencoders on Language Models☆1,130Updated this week
- ☆854Updated last month
- The nnsight package enables interpreting and manipulating the internals of deep learned models.☆738Updated last week
- A library for mechanistic interpretability of GPT-style language models☆2,921Updated 2 weeks ago
- ControlArena is a collection of settings, model organisms and protocols - for running control experiments.☆143Updated last week
- A fast + lightweight implementation of the GCG algorithm in PyTorch☆306Updated 7 months ago
- Sparse Autoencoder for Mechanistic Interpretability☆285Updated last year
- ☆19Updated 8 months ago
- James' cookbook of evaluations and finetuning experiments☆15Updated last week
- [NeurIPS D&B '25] The one-stop repository for LLM unlearning☆452Updated this week
- Sparsify transformers with SAEs and transcoders☆673Updated last week
- Stanford NLP Python library for understanding and improving PyTorch models via interventions☆843Updated 2 months ago
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal☆813Updated last year
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆321Updated 6 months ago
- Mechanistic Interpretability Visualizations using React☆303Updated last year
- ☆373Updated 4 months ago
- Using sparse coding to find distributed representations used by neural networks.☆289Updated 2 years ago
- open source interpretability platform 🧠☆562Updated last week
- Collection of evals for Inspect AI☆313Updated last week
- ☆261Updated last year
- ☆223Updated last year
- Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.☆236Updated 4 months ago
- (Model-written) LLM evals library☆18Updated last year
- This repository collects all relevant resources about interpretability in LLMs☆389Updated last year
- ☆233Updated 3 weeks ago
- Unified access to Large Language Model modules using NNsight☆70Updated last month
- An repository of 2025-2026 AI Safety and Alignment programs, camps, and workshops.☆21Updated 7 months ago
- ☆192Updated last year
- List of papers on hallucination detection in LLMs.☆1,008Updated last month
- METR Task Standard☆169Updated 10 months ago