This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen, Carl Vondrick, and Chengzhi Mao.
☆56Dec 9, 2024Updated last year
Alternatives and similar repositories for selfie
Users that are interested in selfie are comparing it to the libraries listed below
Sorting:
- ☆25Dec 20, 2023Updated 2 years ago
- ✒️ A gallery of experiments with Scalable Vector Graphics (SVG) and interactive visualizations.☆13Jan 6, 2023Updated 3 years ago
- ☆28Nov 16, 2025Updated 3 months ago
- Train text generation model with JavaScript.☆15Jul 14, 2024Updated last year
- Data and models for the paper "Configurable Safety Tuning of Language Models with Synthetic Preference Data"☆17Jul 27, 2024Updated last year
- Sparse Autoencoder Training Library☆55May 1, 2025Updated 10 months ago
- ☆12Oct 23, 2022Updated 3 years ago
- ☆156Dec 30, 2025Updated 2 months ago
- Learning from preferences is a common paradigm for fine-tuning language models. Yet, many algorithmic design decisions come into play. Ou…☆32Apr 20, 2024Updated last year
- This is the oficial repository for "Safer-Instruct: Aligning Language Models with Automated Preference Data"☆17Feb 22, 2024Updated 2 years ago
- ☆22Feb 13, 2026Updated 3 weeks ago
- ☆33Jul 9, 2025Updated 8 months ago
- ☆23Mar 11, 2025Updated 11 months ago
- ☆17Feb 14, 2024Updated 2 years ago
- DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails☆31Feb 26, 2025Updated last year
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆248Feb 27, 2026Updated last week
- TACL 2025: Investigating Adversarial Trigger Transfer in Large Language Models☆19Aug 17, 2025Updated 6 months ago
- Official codebase for "Analyzing the Generalization and Reliability of Steering Vectors"☆19Dec 14, 2024Updated last year
- All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks☆18Apr 24, 2024Updated last year
- ☆27Nov 28, 2024Updated last year
- Applying SAEs for fine-grained control☆25Dec 15, 2024Updated last year
- ☆17Aug 1, 2025Updated 7 months ago
- Social Network Analysis and STEM Education is designed to prepare researchers to apply network analysis in order to better understand and…☆14Jul 14, 2025Updated 7 months ago
- AI-free static security scanner for Claude Code artifacts (Skills, Hooks, MCP configs). Detects data exfiltration, prompt injection, and …☆17Updated this week
- Mixture of Cognitive Reasoners: Modular Reasoning with Brain-Like Specialization☆39Feb 7, 2026Updated last month
- Official Implementation of NeurIPS'23 Paper "Cross-Episodic Curriculum for Transformer Agents"☆31Oct 12, 2023Updated 2 years ago
- ☆30Aug 2, 2024Updated last year
- Improving Steering Vectors by Targeting Sparse Autoencoder Features☆27Nov 20, 2024Updated last year
- This is the official repository for the "Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP" paper acce…☆25Feb 16, 2026Updated 3 weeks ago
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆127Mar 22, 2024Updated last year
- Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"☆27Apr 17, 2024Updated last year
- Code repo for the model organisms and convergent directions of EM papers.☆53Sep 22, 2025Updated 5 months ago
- ☆27Oct 22, 2024Updated last year
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆70Feb 22, 2024Updated 2 years ago
- Algebraic value editing in pretrained language models☆69Nov 1, 2023Updated 2 years ago
- ☆209Oct 14, 2025Updated 4 months ago
- Official Implementation for "In-Context Reinforcement Learning from Noise Distillation"☆34Sep 18, 2024Updated last year
- Plugin QGIS☆10Jan 16, 2023Updated 3 years ago
- This module is a tool for calculating correlations such as Partial, Tetrachoric, Intraclass correlation coefficients, Bootstrap agreement…☆11Feb 16, 2026Updated 3 weeks ago