sambowyer / bayes_evalsLinks

A lightweight library for Bayesian analysis of LLM evals (ICML 2025 Spotlight Position Paper)

☆18

Alternatives and similar repositories for bayes_evals

Users that are interested in bayes_evals are comparing it to the libraries listed below

Sorting:

bhaweshiitk / ConformalLLM
Extending Conformal Prediction to LLMs
☆67Updated last year
ApolloResearch / apd
Attribution-based Parameter Decomposition
☆27Updated last month
jonhue / activeft
PyTorch library for Active Fine-Tuning
☆87Updated 5 months ago
Mayo-Radiology-Informatics-Lab / conflare
This is the repository for the CONFLARE (CONformal LArge language model REtrieval) Python package.
☆19Updated last year
nikitadhawan / natural
☆43Updated 8 months ago
Networks-Learning / counterfactual-llms
Code for "Counterfactual Token Generation in Large Language Models", Arxiv 2024.
☆28Updated 8 months ago
Bradley-Butcher / Conformers
Unofficial implementation of Conformal Language Modeling by Quach et al
☆29Updated 2 years ago
MadryLab / platinum-benchmarks
☆29Updated 3 months ago
shreyansh26 / LLM-Sampling
A collection of various LLM sampling methods implemented in pure Pytorch
☆23Updated 7 months ago
vdlad / Remarkable-Robustness-of-LLMs
Codebase the paper "The Remarkable Robustness of LLMs: Stages of Inference?"
☆18Updated last month
allenai / infinigram-api
☆69Updated last month
kevinwu23 / StanfordFineTuneBench
☆30Updated 8 months ago
ContextualAI / CLAIR_and_APO
Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
☆60Updated 10 months ago
krypticmouse / matryoshka-representation-learning
PyTorch implementation for MRL
☆19Updated last year
yash-srivastava19 / arrakis
Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
☆31Updated 2 months ago
google-deepmind / mishax
☆134Updated 3 months ago
interpretml / LLM-Tabular-Memorization-Checker
Testing Language Models for Memorization of Tabular Datasets.
☆34Updated 5 months ago
allenai / hybrid-preferences
Learning to route instances for Human vs AI Feedback (ACL 2025 Main)
☆23Updated 2 months ago
ahstat / episodic-memory-benchmark
Synthetic data generation and benchmark implementation for "Episodic Memories Generation and Evaluation Benchmark for Large Language Mode…
☆48Updated 3 months ago
Pleias / Quest-Best-Tokens
An introduction to LLM Sampling
☆79Updated 7 months ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆102Updated 3 weeks ago
ConsequentAI / fneval
Functional Benchmarks and the Reasoning Gap
☆88Updated 9 months ago
AnswerDotAI / fastkmeans
☆62Updated last week
socialfoundations / folktexts
Evaluate uncertainty, calibration, accuracy, and fairness of LLMs on real-world survey data!
☆23Updated 3 months ago
ZeroSumEval / ZeroSumEval
A framework for pitting LLMs against each other in an evolving library of games ⚔
☆32Updated 3 months ago
thomasnormal / fewshot
☆28Updated 3 weeks ago
vinid / NegotiationArena
☆72Updated last year
jxmorris12 / cde
code for training & evaluating Contextual Document Embedding models
☆194Updated 2 months ago
ml-jku / SDLG
SDLG is an efficient method to accurately estimate aleatoric semantic uncertainty in LLMs
☆25Updated last year
jjcherian / conditional-conformal
A package for conformal prediction with conditional guarantees.
☆61Updated 4 months ago