anthropics / evals
☆253Updated 7 months ago
Alternatives and similar repositories for evals:
Users that are interested in evals are comparing it to the libraries listed below
- Keeping language models honest by directly eliciting knowledge encoded in their activations.☆195Updated this week
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆71Updated last year
- ☆128Updated 3 months ago
- ☆228Updated 2 years ago
- Mechanistic Interpretability Visualizations using React☆227Updated last month
- METR Task Standard☆142Updated last week
- Python library which enables complex compositions of language models such as scratchpads, chain of thought, tool use, selection-inference…☆202Updated last month
- ☆262Updated 11 months ago
- Mass-editing thousands of facts into a transformer memory (ICLR 2023)☆462Updated last year
- Extract full next-token probabilities via language model APIs☆228Updated 11 months ago
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆182Updated last month
- ☆203Updated 4 months ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆87Updated last year
- Learning to Compress Prompts with Gist Tokens - https://arxiv.org/abs/2304.08467☆274Updated this week
- Erasing concepts from neural representations with provable guarantees☆222Updated 2 weeks ago
- ☆177Updated last year
- ☆25Updated 10 months ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆87Updated 2 months ago
- RuLES: a benchmark for evaluating rule-following in language models☆217Updated this week
- Draw more samples☆186Updated 7 months ago
- Code and data for "Measuring and Narrowing the Compositionality Gap in Language Models"☆309Updated last year
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆205Updated 9 months ago
- A dataset of alignment research and code to reproduce it☆73Updated last year
- Inspecting and Editing Knowledge Representations in Language Models☆112Updated last year
- ☆61Updated 2 weeks ago
- ☆160Updated last year
- Steering Llama 2 with Contrastive Activation Addition☆122Updated 8 months ago
- ☆116Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆181Updated 4 months ago
- Used for adaptive human in the loop evaluation of language and embedding models.☆306Updated last year