☆35Feb 20, 2025Updated last year
Alternatives and similar repositories for behavioral-self-awareness
Users that are interested in behavioral-self-awareness are comparing it to the libraries listed below
Sorting:
- Code for the paper "Distinguishing the Knowable from the Unknowable with Language Models"☆11Apr 15, 2024Updated last year
- An implementation of MSSRM method☆11Mar 23, 2023Updated 2 years ago
- ☆26Sep 3, 2025Updated 6 months ago
- Code repo for the model organisms and convergent directions of EM papers.☆53Sep 22, 2025Updated 5 months ago
- ☆18Apr 7, 2025Updated 11 months ago
- Reimplementation of https://github.com/montemac/algebraic_value_editing in pure PyTorch for efficiency on large models☆11Jun 28, 2023Updated 2 years ago
- ☆20Nov 15, 2024Updated last year
- KernelBench v2: Can LLMs Write GPU Kernels? - Benchmark with Torch -> Triton (and more!) problems☆22Jul 4, 2025Updated 8 months ago
- An original implementation of the paper "CREPE: Open-Domain Question Answering with False Presuppositions"☆16Nov 5, 2024Updated last year
- Distribution Preserving Backdoor Attack in Self-supervised Learning☆20Jan 27, 2024Updated 2 years ago
- ☆16Jul 23, 2024Updated last year
- ☆32Mar 12, 2025Updated 11 months ago
- Code for our paper "Decomposing The Dark Matter of Sparse Autoencoders"☆23Feb 6, 2025Updated last year
- minimal Energy-based transformer☆43Dec 11, 2025Updated 2 months ago
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- Improving Steering Vectors by Targeting Sparse Autoencoder Features☆27Nov 20, 2024Updated last year
- Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]☆32Jan 23, 2025Updated last year
- Multi-Layer Sparse Autoencoders (ICLR 2025)☆29Feb 6, 2026Updated last month
- ☆48May 27, 2025Updated 9 months ago
- ☆27Mar 13, 2024Updated last year
- Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks☆32Jul 9, 2024Updated last year
- ☆13Oct 5, 2025Updated 5 months ago
- Vstream - Video Analytics pipeline with Hardware based accelerations (dev - stage)☆10Feb 2, 2024Updated 2 years ago
- ☆36May 21, 2025Updated 9 months ago
- [ICLR 2025] Monet: Mixture of Monosemantic Experts for Transformers☆75Jun 23, 2025Updated 8 months ago
- A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.☆72May 22, 2025Updated 9 months ago
- Measuring the situational awareness of language models☆40Feb 12, 2024Updated 2 years ago
- Code to break Llama Guard☆32Dec 7, 2023Updated 2 years ago
- Build an AI bot in Discord to serve user's personalized reports on what's up in tech☆28Sep 14, 2025Updated 5 months ago
- ControlArena is a collection of settings, model organisms and protocols - for running control experiments.☆160Feb 27, 2026Updated last week
- Library on Arduino to add over the air (OTA) Update Capabilities to bw16/rtl8720DN☆11Aug 6, 2024Updated last year
- Precision Knowledge Editing (PKE): A novel method to reduce toxicity in LLMs while preserving performance, with robust evaluations and ha…☆11Nov 26, 2024Updated last year
- ☆44Feb 9, 2026Updated last month
- ☆24Feb 18, 2026Updated 2 weeks ago
- ☆12Dec 12, 2019Updated 6 years ago
- ☆10Sep 15, 2024Updated last year
- ☆14May 1, 2023Updated 2 years ago
- Trains small LMs. Designed for training on SimpleStories☆12Sep 15, 2025Updated 5 months ago
- Conceptual Construct Representations☆11Feb 23, 2023Updated 3 years ago