UKGovernmentBEIS / inspect_aiLinks

Inspect: A framework for large language model evaluations

☆1,035

Alternatives and similar repositories for inspect_ai

Users that are interested in inspect_ai are comparing it to the libraries listed below

Sorting:

UKGovernmentBEIS / inspect_evals
Collection of evals for Inspect AI
☆167Updated this week
datadreamer-dev / DataDreamer
DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. 🤖💤
☆1,028Updated 4 months ago
carlini / yet-another-applied-llm-benchmark
A benchmark to evaluate language models on questions I've previously asked them to solve.
☆1,018Updated 2 months ago
vec2text / vec2text
utilities for decoding deep representations (like sentence embeddings) back to text
☆827Updated last month
prometheus-eval / prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯
☆952Updated 2 months ago
vgel / repeng
A library for making RepE control vectors
☆613Updated 5 months ago
openai / mle-bench
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
☆760Updated last week
potsawee / selfcheckgpt
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
☆537Updated last year
huggingface / lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
☆1,641Updated this week
jbloomAus / SAELens
Training Sparse Autoencoders on Language Models
☆846Updated this week
METR / task-standard
METR Task Standard
☆151Updated 4 months ago
arthur-ai / bench
A tool for evaluating LLMs
☆419Updated last year
wandb / weave
Weave is a toolkit for developing AI-powered applications, built by Weights & Biases.
☆908Updated this week
AI-secure / DecodingTrust
A Comprehensive Assessment of Trustworthiness in GPT Models
☆294Updated 9 months ago
relari-ai / continuous-eval
Data-Driven Evaluation for LLM-Powered Applications
☆500Updated 5 months ago
sierra-research / tau-bench
Code and Data for Tau-Bench
☆624Updated 5 months ago
ndif-team / nnsight
The nnsight package enables interpreting and manipulating the internals of deep learned models.
☆599Updated this week
huggingface / evaluation-guidebook
Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard a…
☆1,444Updated 5 months ago
mlabonne / llm-autoeval
Automatically evaluate your LLMs in Google Colab
☆643Updated last year
ShengranHu / ADAS
[ICLR 2025] Automated Design of Agentic Systems
☆1,345Updated 5 months ago
PAIR-code / llm-comparator
LLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR t…
☆444Updated 4 months ago
huggingface / datatrove
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆2,426Updated this week
WecoAI / aideml
AIDE: AI-Driven Exploration in the Space of Code. State of the Art machine Learning engineering agents that automates AI R&D.
☆934Updated 2 months ago
AnswerDotAI / rerankers
A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.
☆1,459Updated last month
guardrails-ai / guardrails
Adding guardrails to large language models.
☆5,132Updated 3 weeks ago
zou-group / textgrad
TextGrad: Automatic ''Differentiation'' via Text -- using large language models to backpropagate textual gradients.
☆2,692Updated 2 months ago
METR / vivaria
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
☆96Updated this week
UKGovernmentBEIS / control-arena
ControlArena is a suite of realistic settings, mimicking complex deployment environments, for running control evaluations. This is an alp…
☆69Updated this week
callummcdougall / ARENA_3.0
☆591Updated last week
KarelDO / xmc.dspy
In-Context Learning for eXtreme Multi-Label Classification (XMC) using only a handful of examples.
☆425Updated last year