thestephencasper/benchmarking_interpretability

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/thestephencasper/benchmarking_interpretability)

thestephencasper / benchmarking_interpretability

☆35

Alternatives and similar repositories for benchmarking_interpretability

Users that are interested in benchmarking_interpretability are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

MaheepChaudhary / SAE-Ravel
View on GitHub
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆13Jan 26, 2025Updated last year
thestephencasper / latent_adversarial_training
View on GitHub
☆24Jul 25, 2024Updated last year
danielway / nexrad-volumetric-renderer
View on GitHub
Project exploring 3D volumetric rendering of NEXRAD radar data.
☆13Oct 23, 2023Updated 2 years ago
rmin2000 / adv_tracing
View on GitHub
Identification of the Adversary from a Single Adversarial Example (ICML 2023)
☆10Jul 15, 2024Updated 2 years ago
milesaturpin / cot-unfaithfulness
View on GitHub
☆57Oct 23, 2023Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
alif-munim / minOFT
View on GitHub
A minimal re-implementation of orthogonal fine-tuning (OFT), a diffusion method, for LLMs. Based on nanoGPT and minLoRA.
☆14Nov 17, 2023Updated 2 years ago
neelnanda-io / Neuroscope
View on GitHub
Accompanying codebase for neuroscope.io, a website for displaying max activating dataset examples for language model neurons
☆14Feb 13, 2023Updated 3 years ago
UlisseMini / ana
View on GitHub
The AI that helps you achieve your goals
☆11Feb 4, 2024Updated 2 years ago
avalonstrel / Mitigating-the-Alignment-Tax-of-RLHF
View on GitHub
☆16Feb 8, 2024Updated 2 years ago
Chillee / lit-llama
View on GitHub
Simple (fast) transformer inference in PyTorch with torch.compile + lit-llama code
☆10Aug 29, 2023Updated 2 years ago
johanhelsing / bevy_touch_stick
View on GitHub
An analog touch screen joystick that pretends to be a bevy gamepad
☆13Jul 13, 2024Updated 2 years ago
eburghar / l3charts
View on GitHub
Customizable charts made with TikZ and LaTeX3
☆14Feb 11, 2023Updated 3 years ago
understanding-search / structured-representations-maze-transformers
View on GitHub
see github.com/understanding-search/maze-transformer
☆10Dec 8, 2023Updated 2 years ago
ejnnr / cupbearer
View on GitHub
A library for mechanistic anomaly detection
☆22Jan 9, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
qrdl / flightrec
View on GitHub
Flight Recorder allows to record client program execution and examine it later
☆11Sep 18, 2020Updated 5 years ago
raventan95 / echo-view-classifier
View on GitHub
☆13Apr 7, 2024Updated 2 years ago
varkor / DISTORT
View on GitHub
A small game demonstrating a grid distortion effect
☆15Oct 5, 2021Updated 4 years ago
ethz-spylab / rlhf_trojan_competition
View on GitHub
Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.
☆119Jun 13, 2024Updated 2 years ago
TBosak / ability
View on GitHub
Ability is a browser extension that helps people with varying degrees of ability have more control over their browsing experience.
☆22Sep 24, 2025Updated 9 months ago
EleutherAI / deep-ignorance
View on GitHub
☆20Jan 7, 2026Updated 6 months ago
zmitchell / polsim
View on GitHub
A command line utility for doing polarization simulations
☆17Aug 21, 2019Updated 6 years ago
GDPlumb / ExpO
View on GitHub
Explanation Optimization
☆13Oct 16, 2020Updated 5 years ago
VectorInstitute / privacy-enhancing-techniques
View on GitHub
A collection of demos and utilities prepared ahead of the Vector Institute Privacy Enhancing Techniques (PETs) Bootcamp.
☆15Sep 22, 2022Updated 3 years ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
DYR1 / MoGU
View on GitHub
Our research proposes a novel MoGU framework that improves LLMs' safety while preserving their usability.
☆18Jan 14, 2025Updated last year
princeton-nlp / benign-data-breaks-safety
View on GitHub
☆47Oct 1, 2024Updated last year
aengusl / latent-adversarial-training
View on GitHub
☆48Sep 29, 2024Updated last year
E1-CarpoolProject / carpool-multiagents
View on GitHub
Implementation of a multi-agent system for the modeling of carpooling in a city with one-way streets. Used Python and the Mesa package fo…
☆14Jan 19, 2022Updated 4 years ago
vfleaking / PTST
View on GitHub
Code for safety test in "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"
☆22Sep 21, 2025Updated 10 months ago
HCDM / XRec
View on GitHub
Models for explainable recommendation.
☆12Jan 19, 2024Updated 2 years ago
y0mingzhang / diffuse-distributions
View on GitHub
Forcing Diffuse Distributions out of Language Models
☆18Sep 10, 2024Updated last year
ethz-spylab / jailbreak-tax
View on GitHub
☆24Feb 17, 2026Updated 5 months ago
serre-lab / Horama
View on GitHub
☆18May 1, 2025Updated last year
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
gfx-rs / nv-flip-rs
View on GitHub
Bindings to Nvidia Labs's ꟻLIP image comparison and error visualization library
☆22Updated this week
dbeley / reddit_export_userdata
View on GitHub
Export userdata from your reddit accounts. Submissions, comments, saved, upvoted contents are supported.
☆23Jul 8, 2026Updated last week
illidanlab / inversion-influence-function
View on GitHub
Official codes for "Understanding Deep Gradient Leakage via Inversion Influence Functions", NeurIPS 2023
☆16Oct 13, 2023Updated 2 years ago
drewolbrich / VolumeWindowZoom
View on GitHub
A visionOS project that demonstrates how to scale a volume to account for Window Zoom changes
☆18Apr 3, 2024Updated 2 years ago
socketteer / worldspider
View on GitHub
gpt completions in vscode
☆35Mar 24, 2023Updated 3 years ago
functorism / gpt4-tokenizer-visualizer
View on GitHub
GPT4 Tokenizer Visualizer
☆23May 21, 2023Updated 3 years ago
anthropics / hypercorn
View on GitHub
Hypercorn is an ASGI and WSGI Server based on Hyper libraries and inspired by Gunicorn.
☆21Jan 12, 2026Updated 6 months ago