guidelabs / infembedLinks
Find the samples, in the test data, on which your (generative) model makes mistakes.
☆29Updated last year
Alternatives and similar repositories for infembed
Users that are interested in infembed are comparing it to the libraries listed below
Sorting:
- Repository for PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits, accepted at CVPR 2024 XAI4CV Works…☆19Updated last year
- Steering vectors for transformer language models in Pytorch / Huggingface☆140Updated 11 months ago
- A fast, effective data attribution method for neural networks in PyTorch☆229Updated last year
- PyTorch library for Active Fine-Tuning☆96Updated 4 months ago
- ViT Prisma is a mechanistic interpretability library for Vision and Video Transformers (ViTs).☆337Updated 6 months ago
- Improving Alignment and Robustness with Circuit Breakers☆258Updated last year
- Sparse Autoencoder for Mechanistic Interpretability☆290Updated last year
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆163Updated 7 months ago
- ☆143Updated last month
- ☆206Updated 3 months ago
- Erasing concepts from neural representations with provable guarantees☆243Updated last year
- ☆267Updated last year
- NeuroSurgeon is a package that enables researchers to uncover and manipulate subnetworks within models in Huggingface Transformers☆42Updated 11 months ago
- A library for efficient patching and automatic circuit discovery.☆88Updated last month
- PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)☆42Updated 3 weeks ago
- ☆284Updated last year
- ☆115Updated 11 months ago
- AI Logging for Interpretability and Explainability🔬☆140Updated last year
- Influence Functions with (Eigenvalue-corrected) Kronecker-Factored Approximate Curvature☆178Updated 7 months ago
- Steering Llama 2 with Contrastive Activation Addition☆207Updated last year
- A toolkit for describing model features and intervening on those features to steer behavior.☆228Updated last month
- Source code of "Task arithmetic in the tangent space: Improved editing of pre-trained models".☆110Updated 2 years ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆28Updated last year
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆241Updated 2 weeks ago
- ☆132Updated 2 years ago
- Code for In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering☆198Updated 11 months ago
- [ICLR 2025] General-purpose activation steering library☆141Updated 4 months ago
- ☆389Updated 5 months ago
- Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"☆45Updated last year
- This repository collects all relevant resources about interpretability in LLMs☆391Updated last year