science-of-finetuning / sparsity-artifacts-crosscodersLinks

Code for the "Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning" paper.

☆11

Alternatives and similar repositories for sparsity-artifacts-crosscoders

Users that are interested in sparsity-artifacts-crosscoders are comparing it to the libraries listed below

Sorting:

Nix07 / finetuning
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…
☆27Updated last year
ethancaballero / broken_neural_scaling_laws
Code Release for "Broken Neural Scaling Laws" (BNSL) paper
☆59Updated last year
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆53Updated 2 months ago
adamkarvonen / SAE_BoardGameEval
☆23Updated 5 months ago
epfml / schedules-and-scaling
Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"
☆75Updated 8 months ago
janphilippfranken / sami
Self-Supervised Alignment with Mutual Information
☆20Updated last year
abhishekpanigrahi1996 / transformer_in_transformer
☆45Updated last year
RobertCsordas / ndr
The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization".
☆33Updated last month
gregorbachmann / Next-Token-Failures
☆87Updated last year
jiahai-feng / binding-iclr
☆14Updated last year
formll / resolving-scaling-law-discrepancies
☆20Updated last year
tml-epfl / why-weight-decay
Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]
☆66Updated 9 months ago
ImKeTT / AdaVAE
[Preprint] AdaVAE: Exploring Adaptive GPT-2s in VAEs for Language Modeling PyTorch Implementation
☆35Updated last year
rtaori / data_feedback
Code for the paper "Data Feedback Loops: Model-driven Amplification of Dataset Biases"
☆16Updated 2 years ago
JeanKaddour / NoTrainNoGain
Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)
☆80Updated last year
IdoAmos / not-from-scratch
☆32Updated 8 months ago
JeanKaddour / LAWA
Latest Weight Averaging (NeurIPS HITY 2022)
☆30Updated 2 years ago
bilal-chughtai / rep-theory-mech-interp
☆26Updated 2 years ago
MadryLab / datamodels-data
Data for "Datamodels: Predicting Predictions with Training Data"
☆97Updated 2 years ago
taufeeque9 / codebook-features
Sparse and discrete interpretability tool for neural networks
☆63Updated last year
mcleish7 / gemstone-scaling-laws
☆27Updated 5 months ago
shoaibahmed / metadata_archaeology
Official code for the paper: "Metadata Archaeology"
☆19Updated 2 years ago
yikangshen / megablocks
☆20Updated last year
AhmedImtiazPrio / grok-adversarial
Deep Networks Grok All the Time and Here is Why
☆37Updated last year
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆70Updated 2 months ago
koayon / atp_star
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
☆18Updated 5 months ago
KihoPark / linear_rep_geometry
☆100Updated 5 months ago
pietrolesci / memorisation-profiles
This is the official implementation for our ACL 2024 paper: "Causal Estimation of Memorisation Profiles".
☆23Updated 3 months ago
Doraemonzzz / tnn-pytorch
☆20Updated 2 years ago
google-deepmind / emergent_in_context_learning
☆84Updated 11 months ago