CentreSecuriteIA / BELLS
Benchmarks for the Evaluation of LLM Supervision
☆28Updated last month
Related projects ⓘ
Alternatives and complementary repositories for BELLS
- This repository collects all relevant resources about interpretability in LLMs☆289Updated 3 weeks ago
- Inspect: A framework for large language model evaluations☆624Updated this week
- Tools for understanding how transformer predictions are built layer-by-layer☆430Updated 5 months ago
- 📚 A curated list of papers & technical articles on AI Quality & Safety☆161Updated last year
- Mechanistic Interpretability Visualizations using React☆200Updated 4 months ago
- METR Task Standard☆124Updated 3 weeks ago
- The nnsight package enables interpreting and manipulating the internals of deep learned models.☆406Updated this week
- Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions☆646Updated 2 weeks ago
- ☆188Updated last month
- Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.☆200Updated 9 months ago
- ☆351Updated this week
- Machine Learning for Alignment Bootcamp☆25Updated 8 months ago
- Tools for studying developmental interpretability in neural networks.☆77Updated last week
- Sparse Autoencoder for Mechanistic Interpretability☆188Updated 4 months ago
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models☆471Updated 4 months ago
- Aligning AI With Shared Human Values (ICLR 2021)☆254Updated last year
- Steering vectors for transformer language models in Pytorch / Huggingface☆65Updated last month
- Training Sparse Autoencoders on Language Models☆469Updated this week
- ☆258Updated this week
- 🧠 Starter templates for doing interpretability research☆63Updated last year
- ☆64Updated last year
- Interpretability for sequence generation models 🐛 🔍☆377Updated last week
- ☆253Updated 8 months ago
- Croissant is a high-level format for machine learning datasets that brings together four rich layers.☆450Updated last week
- Using sparse coding to find distributed representations used by neural networks.☆185Updated last year
- Sparse autoencoders☆344Updated last week
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆811Updated this week
- ☆239Updated 4 months ago
- Erasing concepts from neural representations with provable guarantees☆209Updated last week
- Repository for research in the field of Responsible NLP at Meta.☆186Updated last week