CentreSecuriteIA / BELLS
Benchmarks for the Evaluation of LLM Supervision
☆26Updated 3 weeks ago
Related projects: ⓘ
- Aligning AI With Shared Human Values (ICLR 2021)☆233Updated last year
- Croissant is a high-level format for machine learning datasets that brings together four rich layers.☆406Updated last week
- 📚 A curated list of papers & technical articles on AI Quality & Safety☆155Updated 11 months ago
- METR Task Standard☆111Updated last week
- This repository collects all relevant resources about interpretability in LLMs☆230Updated last week
- Mechanistic Interpretability Visualizations using React☆175Updated 2 months ago
- Interpretability for sequence generation models 🐛 🔍☆361Updated 3 weeks ago
- Inspect: A framework for large language model evaluations☆546Updated this week
- Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.☆185Updated 7 months ago
- Tools for understanding how transformer predictions are built layer-by-layer☆408Updated 3 months ago
- Training Sparse Autoencoders on Language Models☆367Updated this week
- Stanford CRFM's initiative to assess potential compliance with the draft EU AI Act☆92Updated 11 months ago
- The nnsight package enables interpreting and manipulating the internals of deep learned models.☆356Updated this week
- ☆174Updated 4 months ago
- ☆256Updated this week
- utilities for decoding deep representations (like sentence embeddings) back to text☆693Updated last week
- The Foundation Model Transparency Index☆65Updated 3 months ago
- ☆268Updated last month
- ☆274Updated this week
- Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions☆601Updated 2 weeks ago
- This repo contains the code for generating the ToxiGen dataset, published at ACL 2022.☆270Updated 3 months ago
- ☆184Updated last week
- A Comprehensive Assessment of Trustworthiness in GPT Models☆250Updated this week
- Find and fix bugs in natural language machine learning models using adaptive testing.☆181Updated 4 months ago
- ☆110Updated 3 weeks ago
- Machine Learning for Alignment Bootcamp☆25Updated 6 months ago
- Differentially-private transformers using HuggingFace and Opacus☆108Updated 3 weeks ago
- Erasing concepts from neural representations with provable guarantees☆202Updated 3 months ago
- ☆40Updated this week
- Training data extraction on GPT-2☆166Updated last year