moirage / alignment-research-datasetLinks
A dataset of alignment research and code to reproduce it
☆78Updated 2 years ago
Alternatives and similar repositories for alignment-research-dataset
Users that are interested in alignment-research-dataset are comparing it to the libraries listed below
Sorting:
- Keeping language models honest by directly eliciting knowledge encoded in their activations.☆216Updated this week
- Mechanistic Interpretability for Transformer Models☆53Updated 3 years ago
- ☆19Updated 2 years ago
- One stop shop for all things carp☆59Updated 3 years ago
- Measuring the situational awareness of language models☆39Updated last year
- Drive a browser with Cohere☆73Updated 2 years ago
- Python library which enables complex compositions of language models such as scratchpads, chain of thought, tool use, selection-inference…☆215Updated 6 months ago
- Probabilistic LLM evaluations. [CogSci2023; ACL2023]☆73Updated last year
- Erasing concepts from neural representations with provable guarantees☆240Updated 10 months ago
- MiniHF is an inference, human preference data collection, and fine-tuning tool for local language models. It is intended to help the user…☆183Updated last month
- ☆29Updated last year
- Emergent world representations: Exploring a sequence model trained on a synthetic task☆197Updated 2 years ago