IParraMartin / Sparse-AutoencoderLinks
A PyTorch implementation of a Sparse Auto Encoder (SAE) using MSE loss and KL Divergence penalty
☆24Updated last year
Alternatives and similar repositories for Sparse-Autoencoder
Users that are interested in Sparse-Autoencoder are comparing it to the libraries listed below
Sorting:
- PyTorch library for Active Fine-Tuning☆93Updated last month
- Sparse and discrete interpretability tool for neural networks☆64Updated last year
- Sparse Autoencoder for Mechanistic Interpretability☆283Updated last year
- Official implementation of MAIA, A Multimodal Automated Interpretability Agent☆93Updated 2 weeks ago
- [NeurIPS 2024] Official implementation of the paper "MambaLRP: Explaining Selective State Space Sequence Models" 🐍☆45Updated last year
- Towards Understanding the Mixture-of-Experts Layer in Deep Learning☆32Updated last year
- Official codebase for "Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions" (Matrenok …☆27Updated 3 months ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆127Updated 8 months ago
- Sparse Autoencoder Training Library☆55Updated 6 months ago
- Flexible library for merging large language models (LLMs) via evolutionary optimization (ACL 2025 Demo).☆91Updated 3 months ago
- Code for ExploreTom☆86Updated 4 months ago
- Code repository for Black Mamba☆258Updated last year
- [NeurIPS 2024] Goldfish Loss: Mitigating Memorization in Generative LLMs☆92Updated 11 months ago
- Code for In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering☆190Updated 8 months ago
- Code for reproducing our paper "Not All Language Model Features Are Linear"☆83Updated 11 months ago
- Official implementation of FIND (NeurIPS '23) Function Interpretation Benchmark and Automated Interpretability Agents☆51Updated last year
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆120Updated last year
- Code for "Accelerating Training with Neuron Interaction and Nowcasting Networks" [ICLR 2025]☆24Updated 3 weeks ago
- Explaining ML models using LLMs☆23Updated last year
- We study toy models of skill learning.☆31Updated 9 months ago
- [ICLR 2025] Monet: Mixture of Monosemantic Experts for Transformers☆73Updated 4 months ago
- Codebase the paper "The Remarkable Robustness of LLMs: Stages of Inference?"☆19Updated 4 months ago
- nanoGPT-like codebase for LLM training☆110Updated this week
- This is the official repository for the paper "Flora: Low-Rank Adapters Are Secretly Gradient Compressors" in ICML 2024.☆104Updated last year
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆232Updated 3 months ago
- ☆34Updated last year
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆171Updated 4 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆138Updated 4 months ago
- A single repo with all scripts and utils to train / fine-tune the Mamba model with or without FIM☆59Updated last year
- Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"☆71Updated last year