Aleph-Alpha-Research / AtManLinks

☆30

Alternatives and similar repositories for AtMan

Users that are interested in AtMan are comparing it to the libraries listed below

Sorting:

anthropics / sleeper-agents-paper
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
☆109Updated last year
kaistAI / FLASK
[ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
☆217Updated last year
lukasberglund / reversal_curse
☆291Updated last year
FastEval / FastEval
Fast & more realistic evaluation of chat language models. Includes leaderboard.
☆187Updated last year
princeton-nlp / intercode
[NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898
☆222Updated last year
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆115Updated 4 months ago
andyrdt / refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆243Updated last month
AlignmentResearch / tuned-lens
Tools for understanding how transformer predictions are built layer-by-layer
☆505Updated last year
OSU-NLP-Group / GrokkedTransformer
Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'
☆224Updated this week
anthropics / evals
☆283Updated last year
JinjieNi / MixEval
The official evaluation suite and dynamic data release for MixEval.
☆242Updated 8 months ago
stanford-crfm / ecosystem-graphs
☆268Updated 5 months ago
UKGovernmentBEIS / inspect_evals
Collection of evals for Inspect AI
☆178Updated last week
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆206Updated 7 months ago
ScalingIntelligence / Archon
Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.
☆173Updated 4 months ago
vivek3141 / ghostbuster
Ghostbuster: Detecting Text Ghostwritten by Large Language Models (NAACL 2024)
☆160Updated last year
normster / llm_rules
RuLES: a benchmark for evaluating rule-following in language models
☆227Updated 4 months ago
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆262Updated 7 months ago
msclar / formatspread
Code accompanying "How I learned to start worrying about prompt formatting".
☆106Updated last month
pacman100 / LLM-Workshop
LLM Workshop by Sourab Mangrulkar
☆387Updated last year
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆95Updated last year
felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆160Updated last year
TransluceAI / observatory
A toolkit for describing model features and intervening on those features to steer behavior.
☆193Updated 8 months ago
lm-sys / llm-decontaminator
Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
☆306Updated last year
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆220Updated 9 months ago
ConsequentAI / fneval
Functional Benchmarks and the Reasoning Gap
☆88Updated 9 months ago
abacaj / code-eval
Run evaluation on LLMs using human-eval benchmark
☆415Updated last year
jayelm / gisting
Learning to Compress Prompts with Gist Tokens - https://arxiv.org/abs/2304.08467
☆289Updated 5 months ago
jongjyh / TrFr
Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning
☆46Updated last year
snap-stanford / MLAgentBench
☆297Updated last year