mlcommons / modelgaugeLinks

Make it easy to automatically and uniformly measure the behavior of many AI Systems.

☆26

Alternatives and similar repositories for modelgauge

Users that are interested in modelgauge are comparing it to the libraries listed below

Sorting:

mlcommons / modelbench
Run safety benchmarks against AI models and view detailed reports showing how well they performed.
☆108Updated this week
EleutherAI / concept-erasure
Erasing concepts from neural representations with provable guarantees
☆239Updated 9 months ago
taufeeque9 / codebook-features
Sparse and discrete interpretability tool for neural networks
☆64Updated last year
amazon-science / llm-interpret
Code for the ACL 2023 paper: "Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Sc…
☆34Updated 2 years ago
neelsjain / BYOD
The Official Repository for "Bring Your Own Data! Self-Supervised Evaluation for Large Language Models"
☆107Updated 2 years ago
yidingjiang / ado
The repository contains code for Adaptive Data Optimization
☆27Updated 10 months ago
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆28Updated last year
google-deepmind / mishax
☆142Updated last month
stanford-crfm / EUAIActJune15
Stanford CRFM's initiative to assess potential compliance with the draft EU AI Act
☆93Updated 2 years ago
hadasah / btm
☆76Updated last year
mcleish7 / gemstone-scaling-laws
Gemstones: A Model Suite for Multi-Faceted Scaling Laws (NeurIPS 2025)
☆29Updated last month
r-three / git-theta
git extension for {collaborative, communal, continual} model development
☆215Updated 11 months ago
microsoft / mechanistic-error-probe
A mechanistic approach for understanding and detecting factual errors of large language models.
☆46Updated last year
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆71Updated last year
ConsequentAI / fneval
Functional Benchmarks and the Reasoning Gap
☆89Updated last year
guy-dar / embedding-space
☆55Updated 2 years ago
aryamanarora / causalgym
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
☆48Updated 11 months ago
JeanKaddour / NoTrainNoGain
Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)
☆80Updated 2 years ago
ethz-spylab / superhuman-ai-consistency
☆31Updated 2 years ago
EleutherAI / semantic-memorization
☆44Updated 11 months ago
tanyuqian / redco
NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference
☆68Updated 10 months ago
fbarez / neuroplasticity
☆14Updated last year
srush / LLM-Talk
☆52Updated last year
EleutherAI / features-across-time
Understanding how features learned by neural networks evolve throughout training
☆39Updated last year
llm-efficiency-challenge / neurips_llm_efficiency_challenge
NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1GPU + 1Day
☆256Updated 2 years ago
felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆164Updated last year
google-deepmind / asyncdiloco
☆46Updated last year
google-deepmind / dangerous-capability-evaluations
☆61Updated last month
microsoft / deep-language-networks
We view Large Language Models as stochastic language layers in a network, where the learnable parameters are the natural language prompts…
☆94Updated last year
google-research / cascades
Python library which enables complex compositions of language models such as scratchpads, chain of thought, tool use, selection-inference…
☆215Updated 5 months ago