gladstoneai / POWERplayLinks
☆11Updated 3 years ago
Alternatives and similar repositories for POWERplay
Users that are interested in POWERplay are comparing it to the libraries listed below
Sorting:
- ☆81Updated 2 months ago
- A text-based game where language models learn to lie and to detect lies.☆12Updated 2 years ago
- ControlArena is a collection of settings, model organisms and protocols - for running control experiments.☆134Updated last week
- ☆143Updated 4 months ago
- ☆65Updated 2 years ago
- ☆259Updated last year
- ☆13Updated 2 years ago
- ☆132Updated last year
- ☆132Updated 2 years ago
- ☆85Updated last year
- Algebraic value editing in pretrained language models☆66Updated 2 years ago
- ☆283Updated last year
- Aligning AI With Shared Human Values (ICLR 2021)☆305Updated 2 years ago
- (Model-written) LLM evals library☆18Updated last year
- Mechanistic Interpretability Visualizations using React☆302Updated 11 months ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆115Updated last year
- Steering Llama 2 with Contrastive Activation Addition☆196Updated last year
- ☆191Updated last year
- ☆195Updated 2 months ago
- ☆240Updated last year
- ☆23Updated last year
- ☆111Updated 10 months ago
- ☆109Updated 3 weeks ago
- [ICLR 2025] General-purpose activation steering library☆127Updated 2 months ago
- ☆23Updated last year
- Redwood Research's transformer interpretability tools☆14Updated 3 years ago
- bloom - evaluate any behavior immediately 🌸🌱☆28Updated last week
- Emergent world representations: Exploring a sequence model trained on a synthetic task☆196Updated 2 years ago
- ☆25Updated 4 years ago
- Keeping language models honest by directly eliciting knowledge encoded in their activations.☆215Updated last week