guidelabs / infembedLinks
Find the samples, in the test data, on which your (generative) model makes mistakes.
☆26Updated 8 months ago
Alternatives and similar repositories for infembed
Users that are interested in infembed are comparing it to the libraries listed below
Sorting:
- A fast, effective data attribution method for neural networks in PyTorch☆211Updated 7 months ago
- ControlArena is a suite of realistic settings, mimicking complex deployment environments, for running control evaluations. This is an alp…☆69Updated this week
- Steering vectors for transformer language models in Pytorch / Huggingface☆108Updated 4 months ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆113Updated last year
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆95Updated 3 weeks ago
- AI Logging for Interpretability and Explainability🔬☆123Updated last year
- ☆22Updated 11 months ago
- ☆29Updated 2 years ago
- Data for "Datamodels: Predicting Predictions with Training Data"☆97Updated 2 years ago
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Updated last year
- ☆36Updated 2 years ago
- Red-Teaming Language Models with DSPy☆198Updated 4 months ago
- ☆45Updated 10 months ago
- ☆99Updated 4 months ago
- Sphynx Hallucination Induction☆54Updated 4 months ago
- Collection of evals for Inspect AI☆167Updated this week
- ☆12Updated 2 years ago
- Influence Functions with (Eigenvalue-corrected) Kronecker-Factored Approximate Curvature☆156Updated last week
- ☆95Updated 4 months ago
- Erasing concepts from neural representations with provable guarantees☆228Updated 5 months ago
- Evaluate interpretability methods on localizing and disentangling concepts in LLMs.☆47Updated 8 months ago
- DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)☆70Updated 8 months ago
- A toolkit for describing model features and intervening on those features to steer behavior.☆190Updated 7 months ago
- ☆101Updated 3 weeks ago
- ☆44Updated last year
- ☆54Updated 2 years ago
- Private Evolution: Generating DP Synthetic Data without Training [ICLR 2024, ICML 2024 Spotlight]☆97Updated 3 weeks ago
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆108Updated last year
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆202Updated 6 months ago
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆127Updated 3 weeks ago