ArjunPanickssery / self_recognitionLinks
☆10Updated last year
Alternatives and similar repositories for self_recognition
Users that are interested in self_recognition are comparing it to the libraries listed below
Sorting:
- LLM Unlearning☆181Updated 2 years ago
- A resource repository for representation engineering in large language models☆148Updated last year
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆93Updated last year
- [EMNLP 2023] Poisoning Retrieval Corpora by Injecting Adversarial Passages https://arxiv.org/abs/2310.19156☆47Updated 2 years ago
- Steering Llama 2 with Contrastive Activation Addition☆207Updated last year
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆127Updated 11 months ago
- ☆193Updated 2 years ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆100Updated 2 years ago
- LoFiT: Localized Fine-tuning on LLM Representations☆44Updated last year
- This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen,…☆55Updated last year
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆173Updated 9 months ago
- awesome SAE papers☆71Updated 8 months ago
- A lightweight library for large laguage model (LLM) jailbreaking defense.☆61Updated 4 months ago
- ☆24Updated last year
- [ICLR'26, NAACL'25 Demo] Toolkit & Benchmark for evaluating the trustworthiness of generative foundation models.☆125Updated 5 months ago
- [ACL 2024] Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models☆60Updated last year
- [ICML 2025] "From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?"☆49Updated 4 months ago
- A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..☆295Updated 2 weeks ago
- Official Code Repository for LM-Steer Paper: "Word Embeddings Are Steers for Language Models" (ACL 2024 Outstanding Paper Award)☆136Updated 6 months ago
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆158Updated 8 months ago
- Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization☆41Updated last year
- This is the official code for the paper "Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable".☆28Updated 10 months ago
- [ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆89Updated 10 months ago
- [ICLR 2025] General-purpose activation steering library☆141Updated 4 months ago
- Awesome LLM Self-Consistency: a curated list of Self-consistency in Large Language Models☆119Updated 6 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆85Updated 11 months ago
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆98Updated last year
- Python package for measuring memorization in LLMs.☆184Updated 6 months ago
- A curated list of resources for activation engineering☆123Updated 4 months ago
- ☆56Updated last year