apartresearch / readingwhatwecan
📚📚📚📚📚📚📚📚📚 Reading everything
☆12Updated 6 months ago
Related projects ⓘ
Alternatives and complementary repositories for readingwhatwecan
- ☆24Updated 7 months ago
- Keeping language models honest by directly eliciting knowledge encoded in their activations.☆186Updated this week
- Mechanistic Interpretability for Transformer Models☆49Updated 2 years ago
- ☆20Updated 3 years ago
- PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)☆13Updated 6 months ago
- ☆73Updated 4 months ago
- Measuring the situational awareness of language models☆33Updated 8 months ago
- Erasing concepts from neural representations with provable guarantees☆208Updated 3 weeks ago
- ☆122Updated last week
- Probabilistic LLM evaluations. [CogSci2023; ACL2023]☆73Updated 3 months ago
- ☆44Updated last month
- 🧠Starter templates for doing interpretability research☆63Updated last year
- Sparse and discrete interpretability tool for neural networks☆53Updated 8 months ago
- Repository for the code of the "PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided Decoding" paper, NAACL'22☆64Updated 2 years ago
- we got you bro☆32Updated 3 months ago
- ☆99Updated this week
- A dataset of alignment research and code to reproduce it☆68Updated last year
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆157Updated last month
- ☆239Updated 4 months ago
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆62Updated last year
- Mechanistic Interpretability Visualizations using React☆195Updated 3 months ago
- A toolkit for describing model features and intervening on those features to steer behavior.☆69Updated this week
- ☆65Updated last week
- ☆17Updated 9 months ago
- Utilities for the HuggingFace transformers library☆61Updated last year
- ☆102Updated last month
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆84Updated 8 months ago
- ☆73Updated last month
- Language-annotated Abstraction and Reasoning Corpus☆78Updated last year
- Materials for ConceptARC paper☆76Updated this week