UKGovernmentBEIS / inspect_ai
Inspect: A framework for large language model evaluations
☆783Updated this week
Alternatives and similar repositories for inspect_ai:
Users that are interested in inspect_ai are comparing it to the libraries listed below
- METR Task Standard☆142Updated 2 weeks ago
- Collection of evals for Inspect AI☆77Updated this week
- A library for making RepE control vectors☆551Updated last month
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆1,180Updated this week
- Evaluate your LLM's response with Prometheus and GPT4 💯☆873Updated last month
- Guide for fine-tuning Llama/Mistral/CodeLlama models and more☆567Updated 5 months ago
- A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.☆1,295Updated last week
- A small library of LLM judges☆143Updated 2 weeks ago
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models☆493Updated 7 months ago
- A library for prompt engineering and optimization (SAMMO = Structure-aware Multi-Objective Metaprompt Optimization)☆637Updated 2 months ago
- In-Context Learning for eXtreme Multi-Label Classification (XMC) using only a handful of examples.☆406Updated last year
- LLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR t…☆374Updated last week
- Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard a…☆1,030Updated last month
- A benchmark to evaluate language models on questions I've previously asked them to solve.☆973Updated 3 weeks ago
- Automatically evaluate your LLMs in Google Colab☆592Updated 9 months ago
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆78Updated this week
- Synthetic Data curation for post-training and structured data extraction☆816Updated this week
- A tool for evaluating LLMs☆402Updated 9 months ago
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆2,237Updated last week
- Extract full next-token probabilities via language model APIs☆229Updated 11 months ago
- End-to-end Generative Optimization for AI Agents☆479Updated this week
- System 2 Reasoning Link Collection☆794Updated 2 weeks ago
- A framework-less approach to robust agent development.☆154Updated this week
- ☆459Updated this week
- ShellSage saves sysadmins’ sanity by solving shell script snafus super swiftly☆288Updated last week
- Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy☆1,019Updated last month
- Fiddler Auditor is a tool to evaluate language models.☆175Updated 11 months ago
- Code and Data for Tau-Bench☆273Updated last month
- Weave is a toolkit for developing AI-powered applications, built by Weights & Biases.☆822Updated this week