UKGovernmentBEIS / inspect_aiLinks
Inspect: A framework for large language model evaluations
☆1,035Updated last week
Alternatives and similar repositories for inspect_ai
Users that are interested in inspect_ai are comparing it to the libraries listed below
Sorting:
- Collection of evals for Inspect AI☆167Updated this week
- DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. 🤖💤☆1,028Updated 4 months ago
- A benchmark to evaluate language models on questions I've previously asked them to solve.☆1,018Updated 2 months ago
- utilities for decoding deep representations (like sentence embeddings) back to text☆827Updated last month
- Evaluate your LLM's response with Prometheus and GPT4 💯☆952Updated 2 months ago
- A library for making RepE control vectors☆613Updated 5 months ago
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆760Updated last week
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models☆537Updated last year
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆1,641Updated this week
- Training Sparse Autoencoders on Language Models☆846Updated this week
- METR Task Standard☆151Updated 4 months ago
- A tool for evaluating LLMs☆419Updated last year
- Weave is a toolkit for developing AI-powered applications, built by Weights & Biases.☆908Updated this week
- A Comprehensive Assessment of Trustworthiness in GPT Models☆294Updated 9 months ago
- Data-Driven Evaluation for LLM-Powered Applications☆500Updated 5 months ago
- Code and Data for Tau-Bench☆624Updated 5 months ago
- The nnsight package enables interpreting and manipulating the internals of deep learned models.☆599Updated this week
- Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard a…☆1,444Updated 5 months ago
- Automatically evaluate your LLMs in Google Colab☆643Updated last year
- [ICLR 2025] Automated Design of Agentic Systems☆1,345Updated 5 months ago
- LLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR t…☆444Updated 4 months ago
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆2,426Updated this week
- AIDE: AI-Driven Exploration in the Space of Code. State of the Art machine Learning engineering agents that automates AI R&D.☆934Updated 2 months ago
- A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.☆1,459Updated last month
- Adding guardrails to large language models.☆5,132Updated 3 weeks ago
- TextGrad: Automatic ''Differentiation'' via Text -- using large language models to backpropagate textual gradients.☆2,692Updated 2 months ago
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆96Updated this week
- ControlArena is a suite of realistic settings, mimicking complex deployment environments, for running control evaluations. This is an alp…☆69Updated this week
- ☆591Updated last week
- In-Context Learning for eXtreme Multi-Label Classification (XMC) using only a handful of examples.☆425Updated last year