safety-research / persona_vectorsLinks

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

☆284

Alternatives and similar repositories for persona_vectors

Users that are interested in persona_vectors are comparing it to the libraries listed below

Sorting:

andyrdt / refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆299Updated 5 months ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆141Updated 4 months ago
tonychenxyz / selfie
This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen,…
☆53Updated 11 months ago
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆242Updated last year
OpenMOSS / Language-Model-SAEs
Performant framework for training, analyzing and visualizing Sparse Autoencoders (SAEs) and their frontier variants.
☆163Updated this week
suzgunmirac / dynamic-cheatsheet
Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory
☆195Updated 6 months ago
zjunlp / KnowledgeCircuits
[NeurIPS 2024] Knowledge Circuits in Pretrained Transformers
☆159Updated last week
goodfire-ai / r1-interpretability
Open source interpretability artefacts for R1.
☆163Updated 7 months ago
SALT-NLP / collaborative-gym
Framework and toolkits for building and evaluating collaborative agents that can work together with humans.
☆108Updated 3 weeks ago
ucl-dark / llm_debate
Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆121Updated last year
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆129Updated 9 months ago
facebookresearch / sweet_rl
Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks
☆250Updated 6 months ago
emergent-misalignment / emergent-misalignment
☆226Updated 3 weeks ago
openai / safety-rbr-code-and-data
Code and example data for the paper: Rule Based Rewards for Language Model Safety
☆202Updated last year
declare-lab / trust-align
Codes and datasets for the paper Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Ref…
☆68Updated 8 months ago
sunblaze-ucb / Intuitor
Code for the paper: "Learning to Reason without External Rewards"
☆375Updated 4 months ago
TransluceAI / observatory
A toolkit for describing model features and intervening on those features to steer behavior.
☆214Updated last year
vsubramaniam851 / multiagent-ft
☆222Updated 8 months ago
multi-agent-systems-failure-taxonomy / MAST
☆293Updated 4 months ago
shengliu66 / ICV
Code for In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
☆192Updated 9 months ago
ryoungj / ToolEmu
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
☆172Updated last year
IBM / activation-steering
[ICLR 2025] General-purpose activation steering library
☆119Updated 2 months ago
DataArcTech / LLM-as-a-Judge
☆154Updated last month
knoveleng / open-rs
Official repo for paper: "Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't"
☆268Updated last month
LeonGuertler / TextArena
A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning
☆311Updated 3 weeks ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆193Updated last year
facebookresearch / meta-agents-research-environments
Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike stat…
☆364Updated last week
ypwang61 / One-Shot-RLVR
[NeurIPS 2025] Reinforcement Learning for Reasoning in Large Language Models with One Training Example
☆376Updated last month
zorazrw / agent-workflow-memory
AWM: Agent Workflow Memory
☆355Updated 9 months ago
jacobdunefsky / transcoder_circuits
☆188Updated last year