Aleph-Alpha-Research / AtManLinks
☆30Updated 2 months ago
Alternatives and similar repositories for AtMan
Users that are interested in AtMan are comparing it to the libraries listed below
Sorting:
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆109Updated last year
- [ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets☆217Updated last year
- ☆291Updated last year
- Fast & more realistic evaluation of chat language models. Includes leaderboard.☆187Updated last year
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆222Updated last year
- Steering vectors for transformer language models in Pytorch / Huggingface☆115Updated 4 months ago
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆243Updated last month
- Tools for understanding how transformer predictions are built layer-by-layer☆505Updated last year
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆224Updated this week
- ☆283Updated last year
- The official evaluation suite and dynamic data release for MixEval.☆242Updated 8 months ago
- ☆268Updated 5 months ago
- Collection of evals for Inspect AI☆178Updated last week
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆206Updated 7 months ago
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆173Updated 4 months ago
- Ghostbuster: Detecting Text Ghostwritten by Large Language Models (NAACL 2024)☆160Updated last year
- RuLES: a benchmark for evaluating rule-following in language models☆227Updated 4 months ago
- Mechanistic Interpretability Visualizations using React☆262Updated 7 months ago
- Code accompanying "How I learned to start worrying about prompt formatting".☆106Updated last month
- LLM Workshop by Sourab Mangrulkar☆387Updated last year
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆95Updated last year
- Evaluating LLMs with fewer examples☆160Updated last year
- A toolkit for describing model features and intervening on those features to steer behavior.☆193Updated 8 months ago
- Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"☆306Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆220Updated 9 months ago
- Functional Benchmarks and the Reasoning Gap☆88Updated 9 months ago
- Run evaluation on LLMs using human-eval benchmark☆415Updated last year
- Learning to Compress Prompts with Gist Tokens - https://arxiv.org/abs/2304.08467☆289Updated 5 months ago
- Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning☆46Updated last year
- ☆297Updated last year