safety-research / persona_vectorsLinks
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
☆322Updated 5 months ago
Alternatives and similar repositories for persona_vectors
Users that are interested in persona_vectors are comparing it to the libraries listed below
Sorting:
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆324Updated 6 months ago
- [NeurIPS 2024] Knowledge Circuits in Pretrained Transformers☆162Updated last month
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆161Updated 6 months ago
- Improving Alignment and Robustness with Circuit Breakers☆252Updated last year
- Code for the paper: "Learning to Reason without External Rewards"☆385Updated 6 months ago
- Framework and toolkits for building and evaluating collaborative agents that can work together with humans.☆117Updated last month
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆254Updated 8 months ago
- Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike stat…☆411Updated last month
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆124Updated last year
- Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory☆231Updated 7 months ago
- ☆317Updated 5 months ago
- Open source interpretability artefacts for R1.☆165Updated 8 months ago
- Code for In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering☆196Updated 10 months ago
- This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen,…☆55Updated last year
- [NeurIPS 2025] Reinforcement Learning for Reasoning in Large Language Models with One Training Example☆392Updated last month
- Steering vectors for transformer language models in Pytorch / Huggingface☆137Updated 10 months ago
- ☆166Updated 2 months ago
- Official repo for paper: "Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't"☆271Updated 2 months ago
- ☆226Updated 10 months ago
- AWM: Agent Workflow Memory☆376Updated 2 weeks ago
- A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning☆334Updated 2 months ago
- ☆193Updated last year
- A banchmark list for evaluation of large language models.☆154Updated 4 months ago
- 🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resource…☆351Updated last month
- ☆202Updated 8 months ago
- Code and example data for the paper: Rule Based Rewards for Language Model Safety☆204Updated last year
- Performant framework for training, analyzing and visualizing Sparse Autoencoders (SAEs) and their frontier variants.☆168Updated this week
- [ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆176Updated last year
- ⚓️ Repository for the "Thought Anchors: Which LLM Reasoning Steps Matter?" paper.☆101Updated 2 months ago
- Official implementation of the NeurIPS 2025 paper "Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space"☆295Updated 3 weeks ago