safety-research / persona_vectorsView external linksLinks
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
☆356Jul 30, 2025Updated 6 months ago
Alternatives and similar repositories for persona_vectors
Users that are interested in persona_vectors are comparing it to the libraries listed below
Sorting:
- ☆21Jun 22, 2025Updated 7 months ago
- ☆49Jun 26, 2025Updated 7 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆165Jun 25, 2025Updated 7 months ago
- Code repo for the model organisms and convergent directions of EM papers.☆49Sep 22, 2025Updated 4 months ago
- ☆263Jan 12, 2026Updated last month
- ☆28Nov 16, 2025Updated 3 months ago
- ☆34Feb 20, 2025Updated 11 months ago
- [NeurIPS25] Official repo for "Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning"☆41Oct 3, 2025Updated 4 months ago
- ☆36Apr 30, 2024Updated last year
- An active inference model of Lacanian psychoanalysis☆15Jun 7, 2025Updated 8 months ago
- MLFlow End to End Workshop at Chandigarh University☆11Feb 3, 2023Updated 3 years ago
- Learning to Skip the Middle Layers of Transformers☆17Aug 7, 2025Updated 6 months ago
- Implementation of Reinforce for educational purposes.☆12Jun 12, 2023Updated 2 years ago
- [NeurIPS D&B '25] The one-stop repository for LLM unlearning☆479Dec 24, 2025Updated last month
- Pruning By Explaining Revisited: Optimizing Attribution Methods to Prune CNNs and Transformers, Paper accepted at eXCV workshop of ECCV 2…☆30Jan 6, 2025Updated last year
- [ACL 2025] LongSafety: Evaluating Long-Context Safety of Large Language Models☆15Jun 18, 2025Updated 7 months ago
- ☆18Jan 5, 2026Updated last month
- Deep Learning Type Library☆37Feb 8, 2026Updated last week
- Code Repository for Blog - How to Productionize Large Language Models (LLMs)☆12Mar 27, 2024Updated last year
- [USENIX Security 2025] SOFT: Selective Data Obfuscation for Protecting LLM Fine-tuning against Membership Inference Attacks☆19Sep 18, 2025Updated 4 months ago
- Building reliable Retrieval Augmented Generation(RAG) AI Architecture☆13Jul 30, 2024Updated last year
- This is the implementation for IEEE S&P 2022 paper "Model Orthogonalization: Class Distance Hardening in Neural Networks for Better Secur…☆11Aug 24, 2022Updated 3 years ago
- AI Security Newsletter - A monthly digest of AI security research, insights, reports, upcoming events, and tools & resources☆23Feb 5, 2026Updated last week
- End-to-end codebase for finetuning LLMs (LLaMA 2, 3, etc.) with or without DP☆16Sep 23, 2024Updated last year
- ☆17Aug 30, 2025Updated 5 months ago
- ☆20Nov 15, 2024Updated last year
- Attribution-based Parameter Decomposition☆33Jun 11, 2025Updated 8 months ago
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- [NAACL 2025] Towards Rationality in Language and Multimodal Agents: A Survey☆35Feb 19, 2025Updated 11 months ago
- Röttger et al. (2025): "MSTS: A Multimodal Safety Test Suite for Vision-Language Models"☆16Mar 31, 2025Updated 10 months ago
- ☆20May 25, 2024Updated last year
- The implement of paper:"ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability"☆60Jun 3, 2025Updated 8 months ago
- SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning☆175Sep 18, 2025Updated 4 months ago
- Red Queen Dataset and data generation template☆25Dec 26, 2025Updated last month
- Agent Watch is an AgentOps monitoring library designed for Crew AI applications.☆21Dec 2, 2024Updated last year
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆342Jun 13, 2025Updated 8 months ago
- [COLM 2024] JailBreakV-28K: A comprehensive benchmark designed to evaluate the transferability of LLM jailbreak attacks to MLLMs, and fur…☆85May 9, 2025Updated 9 months ago
- Open source interpretability artefacts for R1.☆170Apr 21, 2025Updated 9 months ago
- Pydantic AI agent that implements the idea of Claude Skills (progressive disclosure) with no reliance on Claude itself.☆64Jan 27, 2026Updated 3 weeks ago