guidelabs / infembedLinks

Find the samples, in the test data, on which your (generative) model makes mistakes.

☆28

Alternatives and similar repositories for infembed

Users that are interested in infembed are comparing it to the libraries listed below

Sorting:

adamkarvonen / SAEBench
☆107Updated this week
MadryLab / trak
A fast, effective data attribution method for neural networks in PyTorch
☆213Updated 8 months ago
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆117Updated 5 months ago
KihoPark / LLM_Categorical_Hierarchical_Representations
☆101Updated 5 months ago
mlepori1 / NeuroSurgeon
NeuroSurgeon is a package that enables researchers to uncover and manipulate subnetworks within models in Huggingface Transformers
☆41Updated 5 months ago
thestephencasper / benchmarking_interpretability
☆34Updated last year
EleutherAI / concept-erasure
Erasing concepts from neural representations with provable guarantees
☆231Updated 5 months ago
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆195Updated last week
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆256Updated last year
saprmarks / feature-circuits
☆181Updated last week
logix-project / logix
AI Logging for Interpretability and Explainability🔬
☆124Updated last year
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆102Updated 3 weeks ago
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆206Updated 7 months ago
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆220Updated 9 months ago
pomonam / kronfluence
Influence Functions with (Eigenvalue-corrected) Kronecker-Factored Approximate Curvature
☆157Updated 3 weeks ago
saprmarks / dictionary_learning
☆315Updated last week
ArthurConmy / Automatic-Circuit-Discovery
☆231Updated 9 months ago
maxdreyer / PURE
Repository for PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits, accepted at CVPR 2024 XAI4CV Works…
☆16Updated last year
thestephencasper / latent_adversarial_training
☆22Updated 11 months ago
EleutherAI / sparsify
Sparsify transformers with SAEs and transcoders
☆590Updated last week
centerforaisafety / Intro_to_ML_Safety
☆72Updated 2 years ago
AlignmentResearch / tuned-lens
Tools for understanding how transformer predictions are built layer-by-layer
☆508Updated last year
explanare / ravel
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆50Updated 9 months ago
collin-burns / discovering_latent_knowledge
☆273Updated last year
KihoPark / linear_rep_geometry
☆100Updated 5 months ago
stanfordnlp / pyvene
Stanford NLP Python library for understanding and improving PyTorch models via interventions
☆776Updated last week
neelnanda-io / 1L-Sparse-Autoencoder
☆123Updated last year
iamgroot42 / mimir
Python package for measuring memorization in LLMs.
☆160Updated this week
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆165Updated last year
IBM / activation-steering
General-purpose activation steering library
☆85Updated 2 months ago