Aleph-Alpha / AtMan

☆29

Alternatives and similar repositories for AtMan:

Users that are interested in AtMan are comparing it to the libraries listed below

anthropics / sleeper-agents-paper
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
☆98Updated last year
OSU-NLP-Group / GrokkedTransformer
Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'
☆186Updated 4 months ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆134Updated 10 months ago
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆91Updated last month
tcapelle / llm_recipes
A set of scripts and notebooks on LLM finetunning and dataset creation
☆105Updated 6 months ago
google-deepmind / dangerous-capability-evaluations
☆53Updated 6 months ago
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆192Updated 6 months ago
bertiev / SimpleSafetyTests
☆17Updated last year
tomekkorbak / pretraining-with-human-feedback
Code accompanying the paper Pretraining Language Models with Human Preferences
☆180Updated last year
ConsequentAI / fneval
Functional Benchmarks and the Reasoning Gap
☆84Updated 6 months ago
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆91Updated last year
kaistAI / FLASK
[ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
☆215Updated last year
bjoernpl / lm-evaluation-harness-de
A framework for few-shot evaluation of autoregressive language models.
☆13Updated last year
felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆148Updated 11 months ago
shengliu66 / ICV
Code for In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
☆167Updated last month
JinjieNi / MixEval
The official evaluation suite and dynamic data release for MixEval.
☆233Updated 4 months ago
ucl-dark / llm_debate
Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆103Updated last year
jacobdunefsky / transcoder_circuits
☆66Updated 4 months ago
annahdo / implementing_activation_steering
A collection of different ways to implement accessing and modifying internal model activations for LLMs
☆14Updated 5 months ago
akjindal53244 / Arithmo
Small and Efficient Mathematical Reasoning LLMs
☆71Updated last year
Cadenza-Labs / sleeper-agents
☆10Updated 8 months ago
andyrdt / refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆198Updated 6 months ago
centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…
☆108Updated 11 months ago
msclar / formatspread
Code accompanying "How I learned to start worrying about prompt formatting".
☆102Updated 5 months ago
Mohammadjafari80 / GSM8K-RLVR
A simplified implementation for experimenting with Reinforcement Learning (RL) on GSM8K, inspired by RLVR and Deepseek R1. This repositor…
☆72Updated last month
jayelm / gisting
Learning to Compress Prompts with Gist Tokens - https://arxiv.org/abs/2304.08467
☆281Updated last month
ArthurConmy / Automatic-Circuit-Discovery
☆214Updated 6 months ago
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆192Updated 3 months ago
andyzoujm / breaking-llama-guard
Code to break Llama Guard
☆31Updated last year
redwoodresearch / alignment_faking_public
☆43Updated 2 months ago