ethz-spylab / superhuman-ai-consistencyLinks

☆30

Alternatives and similar repositories for superhuman-ai-consistency

Users that are interested in superhuman-ai-consistency are comparing it to the libraries listed below

Sorting:

mcleish7 / gemstone-scaling-laws
Gemstones: A Model Suite for Multi-Faceted Scaling Laws (NeurIPS 2025)
☆29Updated last month
ctlllll / reward_collapse
☆27Updated 2 years ago
formll / resolving-scaling-law-discrepancies
☆20Updated 2 weeks ago
yidingjiang / ado
The repository contains code for Adaptive Data Optimization
☆28Updated 11 months ago
tml-epfl / icl-alignment
Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]
☆31Updated 9 months ago
jiahai-feng / binding-iclr
☆16Updated last year
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆71Updated last year
matchten / LoRA-Models-for-SAEs
Code for reproducing our paper "Low Rank Adapting Models for Sparse Autoencoder Features"
☆17Updated 7 months ago
cassidylaidlaw / orpo
☆19Updated last year
azshue / AutoPoison
The official repository of the paper "On the Exploitability of Instruction Tuning".
☆65Updated last year
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆28Updated last year
janphilippfranken / sami
Self-Supervised Alignment with Mutual Information
☆21Updated last year
katiekang1998 / reasoning_generalization
☆33Updated 10 months ago
ahans30 / goldfish-loss
[NeurIPS 2024] Goldfish Loss: Mitigating Memorization in Generative LLMs
☆92Updated last year
MadryLab / datamodels-data
Data for "Datamodels: Predicting Predictions with Training Data"
☆97Updated 2 years ago
MadryLab / modeldiff
ModelDiff: A Framework for Comparing Learning Algorithms
☆59Updated 2 years ago
thestephencasper / explore_establish_exploit_llms
☆31Updated 2 years ago
arobey1 / advbench
☆44Updated 2 years ago
XiangLi1999 / AutoBencher
☆32Updated last year
lingo-mit / lm-truthfulness
☆17Updated last year
locuslab / acr-memorization
☆37Updated 11 months ago
taufeeque9 / codebook-features
Sparse and discrete interpretability tool for neural networks
☆64Updated last year
KoyenaPal / future-lens
Code and Data Repo for the CoNLL Paper -- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
☆20Updated 3 weeks ago
milesaturpin / cot-unfaithfulness
☆51Updated 2 years ago
ejones313 / auditing-llms
☆59Updated 2 years ago
abhishekpanigrahi1996 / transformer_in_transformer
☆45Updated 2 years ago
EleutherAI / mdl
Minimum Description Length probing for neural network representations
☆20Updated 9 months ago
max-andr / adversarial-random-search-gpt4
Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]
☆43Updated last year
MadryLab / DsDm
☆51Updated last year
GXimingLu / IPA
Codebase for Inference-Time Policy Adapters
☆24Updated 2 years ago