jkutaso / SHADE-ArenaLinks
☆20Updated 5 months ago
Alternatives and similar repositories for SHADE-Arena
Users that are interested in SHADE-Arena are comparing it to the libraries listed below
Sorting:
- ☆19Updated 4 months ago
 - This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…☆28Updated last week
 - Align your LM to express calibrated verbal statements of confidence in its long-form generations.☆27Updated last year
 - A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆83Updated 7 months ago
 - Code repo for the model organisms and convergent directions of EM papers.☆36Updated last month
 - ☆49Updated 2 years ago
 - [ICLR 2025] General-purpose activation steering library☆115Updated last month
 - Evaluate interpretability methods on localizing and disentangling concepts in LLMs.☆56Updated this week
 - ☆92Updated last year
 - Teaching Models to Express Their Uncertainty in Words☆39Updated 3 years ago
 - ☆59Updated 2 years ago
 - ☆16Updated last year
 - A library for efficient patching and automatic circuit discovery.☆78Updated 3 months ago
 - Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…☆12Updated 9 months ago
 - ☆17Updated last year
 - ☆46Updated last year
 - Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆136Updated 4 months ago
 - This repository includes code for the paper "Does Localization Inform Editing? Surprising Differences in Where Knowledge Is Stored vs. Ca…☆61Updated 2 years ago
 - This is the official repo for Towards Uncertainty-Aware Language Agent.☆29Updated last year
 - Code for experiments on self-prediction as a way to measure introspection in LLMs☆16Updated 10 months ago
 - Code and Data Repo for the CoNLL Paper -- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State☆20Updated last week
 - Code for "Tracing Knowledge in Language Models Back to the Training Data"☆39Updated 2 years ago
 - This repository contains data, code and models for contextual noncompliance.☆24Updated last year
 - Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""☆18Updated last year
 - The accompanying code for "Transformer Feed-Forward Layers Are Key-Value Memories". Mor Geva, Roei Schuster, Jonathan Berant, and Omer Le…☆97Updated 4 years ago
 - Landing page for MIB: A Mechanistic Interpretability Benchmark☆20Updated 2 months ago
 - ☆36Updated 2 years ago
 - EMNLP 2024: Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue☆37Updated 5 months ago
 - Algebraic value editing in pretrained language models☆66Updated 2 years ago
 - LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆98Updated 2 years ago