☆21Jun 22, 2025Updated 8 months ago
Alternatives and similar repositories for SHADE-Arena
Users that are interested in SHADE-Arena are comparing it to the libraries listed below
Sorting:
- ☆35May 9, 2025Updated 9 months ago
- Open Source Replication of Anthropic's Alignment Faking Paper☆54Apr 4, 2025Updated 10 months ago
- ☆41Jul 6, 2025Updated 7 months ago
- Code for Tangent Model Composition for Ensembling and Continual Fine-tuning (ICCV 2023) and Tangent Transformers for Composition, Privacy…☆13May 14, 2024Updated last year
- AI Security Newsletter - A monthly digest of AI security research, insights, reports, upcoming events, and tools & resources☆25Feb 5, 2026Updated 3 weeks ago
- ☆20Nov 15, 2024Updated last year
- ☆20May 25, 2024Updated last year
- ☆33Jul 9, 2025Updated 7 months ago
- Code to the paper: The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence☆26Jul 31, 2025Updated 7 months ago
- Official PyTorch Implementation for Meaning Representations from Trajectories in Autoregressive Models (ICLR 2024)☆22May 14, 2024Updated last year
- ControlArena is a collection of settings, model organisms and protocols - for running control experiments.☆158Updated this week
- ☆50Jun 26, 2025Updated 8 months ago
- ☆33Feb 17, 2026Updated 2 weeks ago
- ☆27Oct 6, 2024Updated last year
- Code repo for the model organisms and convergent directions of EM papers.☆51Sep 22, 2025Updated 5 months ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆28May 23, 2024Updated last year
- [EMNLP 2024] "Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective"☆32Jul 22, 2024Updated last year
- ☆13Oct 5, 2025Updated 4 months ago
- Notebooks accompanying Anthropic's "Toy Models of Superposition" paper☆137Sep 14, 2022Updated 3 years ago
- Build an AI bot in Discord to serve user's personalized reports on what's up in tech☆28Sep 14, 2025Updated 5 months ago
- Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"☆47May 31, 2024Updated last year
- A holistic benchmark for LLM abstention☆71Aug 27, 2025Updated 6 months ago
- [NeurIPS25] Official repo for "Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning"☆42Oct 3, 2025Updated 5 months ago
- Residual Quantization Autoencoder, used for interpreting LLMs☆14Jan 1, 2025Updated last year
- my profile readme☆14Updated this week
- ☆12Aug 15, 2023Updated 2 years ago
- ☆12Jul 8, 2024Updated last year
- The homework of robos learning base.☆11May 23, 2023Updated 2 years ago
- Reference implementation of Thin and Deep Gaussian Processes (NeurIPS 2023)☆14Nov 25, 2024Updated last year
- pyCEPS provides an interface to import, visualize and translate clinical mapping data☆14Nov 25, 2025Updated 3 months ago
- Proof-carrying code completions in Dafny☆11Apr 4, 2025Updated 10 months ago
- ☆17Apr 4, 2025Updated 10 months ago
- Code for experiments on self-prediction as a way to measure introspection in LLMs☆16Dec 10, 2024Updated last year
- A beautiful Astro theme based on Ghost Simply theme☆12Feb 20, 2026Updated last week
- French Jurisprudences at your fingertips @ every 72h☆15Nov 18, 2025Updated 3 months ago
- Linear Relational Embeddings (LREs) and Linear Relational Concepts (LRCs) for LLMs in PyTorch☆10Aug 7, 2024Updated last year
- Official repository for "DYPLOC: Dynamic Planning of Content Using Mixed Language Models for Opinion Text Generation"☆10May 20, 2022Updated 3 years ago
- Adversarial attack and defense strategies for deep speaker recognition systems☆42Feb 18, 2021Updated 5 years ago
- Machine Learning Reading Group☆11Sep 15, 2023Updated 2 years ago