XuchanBao / behavioral-self-awarenessView external linksLinks
☆34Feb 20, 2025Updated 11 months ago
Alternatives and similar repositories for behavioral-self-awareness
Users that are interested in behavioral-self-awareness are comparing it to the libraries listed below
Sorting:
- An implementation of MSSRM method☆11Mar 23, 2023Updated 2 years ago
- Code repo for the model organisms and convergent directions of EM papers.☆49Sep 22, 2025Updated 4 months ago
- ☆16Apr 7, 2025Updated 10 months ago
- A tiny easily hackable implementation of a feature dashboard.☆15Oct 21, 2025Updated 3 months ago
- Reimplementation of https://github.com/montemac/algebraic_value_editing in pure PyTorch for efficiency on large models☆11Jun 28, 2023Updated 2 years ago
- ☆20Nov 15, 2024Updated last year
- KernelBench v2: Can LLMs Write GPU Kernels? - Benchmark with Torch -> Triton (and more!) problems☆21Jul 4, 2025Updated 7 months ago
- An original implementation of the paper "CREPE: Open-Domain Question Answering with False Presuppositions"☆16Nov 5, 2024Updated last year
- Distribution Preserving Backdoor Attack in Self-supervised Learning☆20Jan 27, 2024Updated 2 years ago
- ☆16Jul 23, 2024Updated last year
- Code for our paper "Decomposing The Dark Matter of Sparse Autoencoders"☆23Feb 6, 2025Updated last year
- ☆263Jan 12, 2026Updated last month
- ☆49Jun 26, 2025Updated 7 months ago
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- Improving Steering Vectors by Targeting Sparse Autoencoder Features☆27Nov 20, 2024Updated last year
- ☆47May 27, 2025Updated 8 months ago
- Multi-Layer Sparse Autoencoders (ICLR 2025)☆29Feb 6, 2026Updated last week
- Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]☆32Jan 23, 2025Updated last year
- ☆27Mar 13, 2024Updated last year
- ☆35May 21, 2025Updated 8 months ago
- Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks☆32Jul 9, 2024Updated last year
- [ICLR 2025] Monet: Mixture of Monosemantic Experts for Transformers☆75Jun 23, 2025Updated 7 months ago
- Official code of paper "Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models"☆86May 27, 2025Updated 8 months ago
- AmpleGCG: Learning a Universal and Transferable Generator of Adversarial Attacks on Both Open and Closed LLM☆84Nov 3, 2024Updated last year
- A toolbox with the goal of speeding up research on bargaining in MARL (cooperation problems in MARL).☆32Sep 29, 2022Updated 3 years ago
- A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.☆71May 22, 2025Updated 8 months ago
- Measuring the situational awareness of language models☆40Feb 12, 2024Updated 2 years ago
- Code to break Llama Guard☆32Dec 7, 2023Updated 2 years ago
- ☆39Feb 9, 2026Updated last week
- StreamlitとLangGraphで実装したHuman-in-the-loop広告コピー文生成アプリケーション☆11Feb 15, 2025Updated last year
- Build an AI bot in Discord to serve user's personalized reports on what's up in tech☆28Sep 14, 2025Updated 5 months ago
- The public web API of the National Museum of Australia☆11Sep 12, 2023Updated 2 years ago
- ☆36Sep 6, 2024Updated last year
- Official implementation of the WASP web agent security benchmark☆67Aug 12, 2025Updated 6 months ago
- ControlArena is a collection of settings, model organisms and protocols - for running control experiments.☆153Updated this week
- Conceptual Construct Representations☆11Feb 23, 2023Updated 2 years ago
- 2020湖南省第一届人工智能大赛参赛作品☆11Feb 17, 2022Updated 4 years ago
- Results of my master thesis. Conditional invertible neural networks in the freia framework were used to dertermine the CO2 concentration …☆10Jan 12, 2020Updated 6 years ago
- ☆10Sep 15, 2024Updated last year