UKGovernmentBEIS / inspect_aiLinks
Inspect: A framework for large language model evaluations
☆992Updated this week
Alternatives and similar repositories for inspect_ai
Users that are interested in inspect_ai are comparing it to the libraries listed below
Sorting:
- Collection of evals for Inspect AI☆144Updated this week
- A library for making RepE control vectors☆595Updated 4 months ago
- A tool for evaluating LLMs☆418Updated last year
- A benchmark to evaluate language models on questions I've previously asked them to solve.☆1,014Updated last month
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆1,574Updated last week
- METR Task Standard☆148Updated 4 months ago
- Evaluate your LLM's response with Prometheus and GPT4 💯☆950Updated last month
- End-to-end Generative Optimization for AI Agents☆586Updated last week
- utilities for decoding deep representations (like sentence embeddings) back to text☆820Updated last week
- LLM Comparator is an interactive data visualization tool for evaluating and analyzing LLM responses side-by-side, developed by the PAIR t…☆439Updated 3 months ago
- A library for mechanistic interpretability of GPT-style language models☆2,217Updated last week
- Automatically evaluate your LLMs in Google Colab☆631Updated last year
- Verifiers for LLM Reinforcement Learning☆1,197Updated this week
- A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.☆1,438Updated last week
- Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard a…☆1,391Updated 5 months ago
- Sparsify transformers with SAEs and transcoders☆553Updated this week
- Scale your LLM-as-a-judge.☆234Updated last week
- Extract full next-token probabilities via language model APIs☆248Updated last year
- ControlArena is a suite of realistic settings, mimicking complex deployment environments, for running control evaluations. This is an alp…☆61Updated this week
- A library for prompt engineering and optimization (SAMMO = Structure-aware Multi-Objective Metaprompt Optimization)☆678Updated 5 months ago
- ☆1,642Updated last week
- Weave is a toolkit for developing AI-powered applications, built by Weights & Biases.☆894Updated this week
- Synthetic data curation for post-training and structured data extraction☆1,372Updated this week
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆734Updated 2 weeks ago
- ☆567Updated last week
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆2,396Updated last week
- Training Sparse Autoencoders on Language Models☆802Updated last week
- LLM Analytics☆664Updated 7 months ago
- List of papers on hallucination detection in LLMs.☆882Updated last week
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆1,881Updated 9 months ago