UKGovernmentBEIS / inspect_ai
Inspect: A framework for large language model evaluations
☆938Updated this week
Alternatives and similar repositories for inspect_ai
Users that are interested in inspect_ai are comparing it to the libraries listed below
Sorting:
- Collection of evals for Inspect AI☆132Updated this week
- A library for making RepE control vectors☆589Updated 4 months ago
- Evaluate your LLM's response with Prometheus and GPT4 💯☆938Updated 3 weeks ago
- A benchmark to evaluate language models on questions I've previously asked them to solve.☆1,010Updated 3 weeks ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆1,537Updated this week
- METR Task Standard☆146Updated 3 months ago
- Automatically evaluate your LLMs in Google Colab☆625Updated last year
- A library for generative social simulation☆870Updated last week
- Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy☆1,153Updated last month
- A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.☆1,411Updated last week
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆92Updated this week
- Training Sparse Autoencoders on Language Models☆770Updated this week
- A tool for evaluating LLMs☆419Updated last year
- Guardrails for secure and robust agent development☆252Updated this week
- Code and Data for Tau-Bench☆485Updated 3 months ago
- System 2 Reasoning Link Collection☆833Updated 2 months ago
- Prompt engineering, automated.☆311Updated 3 weeks ago
- LLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR t…☆423Updated 3 months ago
- utilities for decoding deep representations (like sentence embeddings) back to text☆809Updated last month
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal☆643Updated 9 months ago
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models☆523Updated 10 months ago
- Verdict is a library for scaling judge-time compute.☆211Updated 2 weeks ago
- Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard a…☆1,349Updated 4 months ago
- Verifiers for LLM Reinforcement Learning☆953Updated this week
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆703Updated last week
- The nnsight package enables interpreting and manipulating the internals of deep learned models.☆563Updated this week
- Automated Evaluation of RAG Systems☆590Updated last month
- Data-Driven Evaluation for LLM-Powered Applications☆493Updated 3 months ago
- AutoEvals is a tool for quickly and easily evaluating AI model outputs using best practices.☆477Updated this week
- Sparsify transformers with SAEs and transcoders☆526Updated this week