fiveai / understanding_safety_finetuningLinks
Official Code for What Makes and Breaks Safety Fine-tuning? A Mechanistic Study (NeurIPS 2024)
☆12Updated last year
Alternatives and similar repositories for understanding_safety_finetuning
Users that are interested in understanding_safety_finetuning are comparing it to the libraries listed below
Sorting:
- PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)☆41Updated last year
- ☆37Updated last year
- Code repo for the model organisms and convergent directions of EM papers.☆41Updated 3 months ago
- ☆51Updated 2 years ago
- What do we learn from inverting CLIP models?☆57Updated last year
- An official implementation of "Catastrophic Failure of LLM Unlearning via Quantization" (ICLR 2025)☆35Updated 10 months ago
- ☆16Updated last year
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆161Updated 6 months ago
- Source code of "Task arithmetic in the tangent space: Improved editing of pre-trained models".☆108Updated 2 years ago
- ☆112Updated 10 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆65Updated 6 months ago
- Evaluate interpretability methods on localizing and disentangling concepts in LLMs.☆57Updated 2 months ago
- DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)☆79Updated last year
- Trains Sparse Autoencoders based on outputs from language models☆11Updated last year
- [ICLR 2025] Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates (Oral)☆85Updated last year
- ☆24Updated last year
- NeurIPS'24 - LLM Safety Landscape☆37Updated 2 months ago
- [ICLR 2025] General-purpose activation steering library☆132Updated 3 months ago
- Gemstones: A Model Suite for Multi-Faceted Scaling Laws (NeurIPS 2025)☆30Updated 3 months ago
- Intriguing Properties of Data Attribution on Diffusion Models (ICLR 2024)☆35Updated last year
- ☆69Updated last year
- ☆59Updated 2 years ago
- Tools for optimizing steering vectors in LLMs.☆15Updated 8 months ago
- Code for reproducing our paper "Not All Language Model Features Are Linear"☆83Updated last year
- Official repository for our paper, Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Mode…☆20Updated last year
- Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…☆12Updated 11 months ago
- ☆56Updated 11 months ago
- ☆31Updated 2 years ago
- Function Vectors in Large Language Models (ICLR 2024)☆189Updated 8 months ago
- `dattri` is a PyTorch library for developing, benchmarking, and deploying efficient data attribution algorithms.☆98Updated last week