Ybakman / TruthTorchLMLinks
☆37Updated 2 months ago
Alternatives and similar repositories for TruthTorchLM
Users that are interested in TruthTorchLM are comparing it to the libraries listed below
Sorting:
- DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)☆70Updated 8 months ago
- ☆44Updated 3 months ago
- AI Logging for Interpretability and Explainability🔬☆123Updated last year
- Conformal Language Modeling☆30Updated last year
- ☆172Updated last year
- Source code of "Task arithmetic in the tangent space: Improved editing of pre-trained models".☆102Updated 2 years ago
- This repository collects all relevant resources about interpretability in LLMs☆359Updated 7 months ago
- ☆165Updated 7 months ago
- ☆93Updated 11 months ago
- ☆101Updated 3 weeks ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆96Updated this week
- ☆18Updated last year
- ☆232Updated last year
- A resource repository for representation engineering in large language models☆127Updated 7 months ago
- source code for NeurIPS'24 paper "HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection"☆46Updated 2 months ago
- Using sparse coding to find distributed representations used by neural networks.☆256Updated last year
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆73Updated 3 months ago
- ☆24Updated last year
- General-purpose activation steering library☆81Updated last month
- Code for Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities (NeurIPS'24)☆24Updated 6 months ago
- ☆136Updated 7 months ago
- Influence Functions with (Eigenvalue-corrected) Kronecker-Factored Approximate Curvature☆156Updated this week
- awesome SAE papers☆36Updated last month
- Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization☆26Updated 11 months ago
- Python package for measuring memorization in LLMs.☆159Updated 7 months ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆94Updated last year
- ☆69Updated 3 years ago
- Code for paper: Aligning Large Language Models with Representation Editing: A Control Perspective☆32Updated 5 months ago
- [ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆79Updated 3 months ago
- ☆95Updated last year