This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen, Carl Vondrick, and Chengzhi Mao.
☆56Dec 9, 2024Updated last year
Alternatives and similar repositories for selfie
Users that are interested in selfie are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆23Mar 11, 2025Updated last year
- ☆25Dec 20, 2023Updated 2 years ago
- ☆30Nov 16, 2025Updated 4 months ago
- ☆158Dec 30, 2025Updated 3 months ago
- Playing around with various jailbreaking techniques ahead of the Gray Swan AI Ultimate Jailbreaking Competition☆18Oct 6, 2024Updated last year
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- code of paper "Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM"☆14Nov 17, 2023Updated 2 years ago
- Sparse Autoencoder Training Library☆55May 1, 2025Updated 11 months ago
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆252Feb 27, 2026Updated last month
- [S&P 2026] SoK: Evaluating Jailbreak Guardrails for Large Language Models☆37Dec 17, 2025Updated 3 months ago
- ☆30Aug 2, 2024Updated last year
- ☆28Nov 28, 2024Updated last year
- Attribute statements generated by LLMs to preceding tokens using attention weights.☆24Apr 22, 2025Updated 11 months ago
- Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…☆13Jan 26, 2025Updated last year
- ✒️ A gallery of experiments with Scalable Vector Graphics (SVG) and interactive visualizations.☆13Jan 6, 2023Updated 3 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting with the flexibility to host WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Cloudways by DigitalOcean.
- TACL 2025: Investigating Adversarial Trigger Transfer in Large Language Models☆19Aug 17, 2025Updated 7 months ago
- [ICLR 2025] Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"☆14Jun 21, 2024Updated last year
- Using sparse coding to find distributed representations used by neural networks.☆298Nov 10, 2023Updated 2 years ago
- ☆17Feb 14, 2024Updated 2 years ago
- Data and models for the paper "Configurable Safety Tuning of Language Models with Synthetic Preference Data"☆17Jul 27, 2024Updated last year
- ☆36Sep 28, 2025Updated 6 months ago
- Code for the NAACL 2024 HCI+NLP Workshop paper "LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tool…☆13Mar 24, 2024Updated 2 years ago
- ☆12Apr 25, 2025Updated 11 months ago
- This is the oficial repository for "Safer-Instruct: Aligning Language Models with Automated Preference Data"☆17Feb 22, 2024Updated 2 years ago
- Wordpress hosting with auto-scaling on Cloudways • AdFully Managed hosting built for WordPress-powered businesses that need reliable, auto-scalable hosting. Cloudways SafeUpdates now available.
- ☆22Feb 13, 2026Updated last month
- Official codebase for "Analyzing the Generalization and Reliability of Steering Vectors"☆20Dec 14, 2024Updated last year
- Official repository of paper "LOVE-R1: Advancing Long Video Understanding with Adaptive Zoom-in Mechanism via Multi-Step Reasoning"☆23Nov 1, 2025Updated 5 months ago
- [NeurIPS XAIA & Springer] Code and notebooks to paper "A Fresh Look at Sanity Checks for Saliency Maps"☆25Jul 12, 2024Updated last year
- Code for the paper "Representing Spatial Trajectories as Distributions"☆13Jan 17, 2023Updated 3 years ago
- ☆20Feb 8, 2024Updated 2 years ago
- ☆16Jul 23, 2024Updated last year
- Evaluate interpretability methods on localizing and disentangling concepts in LLMs.☆58Oct 30, 2025Updated 5 months ago
- EMNLP 2024: Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue☆38May 26, 2025Updated 10 months ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model☆574Jan 28, 2025Updated last year
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆129Mar 22, 2024Updated 2 years ago
- ☆12Oct 7, 2024Updated last year
- Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals☆12May 24, 2024Updated last year
- Interpretating the latent space representations of attention head outputs for LLMs☆39Aug 13, 2024Updated last year
- ☆16May 1, 2025Updated 11 months ago
- Algebraic value editing in pretrained language models☆70Nov 1, 2023Updated 2 years ago