stanford-crfm / helmLinks
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.
☆2,335Updated this week
Alternatives and similar repositories for helm
Users that are interested in helm are comparing it to the libraries listed below
Sorting:
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,800Updated 6 months ago
- Benchmarking large language models' complex reasoning ability with chain-of-thought prompting☆2,741Updated 11 months ago
- General technology for enabling AI capabilities w/ LLMs and MLLMs☆4,055Updated 2 weeks ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆1,934Updated 11 months ago
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,762Updated last month
- Measuring Massive Multitask Language Understanding | ICLR 2021☆1,452Updated 2 years ago
- The hub for EleutherAI's work on interpretability and learning dynamics☆2,565Updated last month
- ☆1,529Updated last week
- Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models☆3,079Updated 11 months ago
- AllenAI's post-training codebase☆3,061Updated this week
- Data and tools for generating and inspecting OLMo pre-training data.☆1,264Updated last week
- A family of open-sourced Mixture-of-Experts (MoE) Large Language Models☆1,557Updated last year
- Reference implementation for DPO (Direct Preference Optimization)☆2,638Updated 11 months ago
- Code for our EMNLP 2023 Paper: "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models"☆1,185Updated last year
- Toolkit for creating, sharing and using natural language prompts.☆2,898Updated last year
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆1,722Updated last week
- [ACL2023] We introduce LLM-Blender, an innovative ensembling framework to attain consistently superior performance by leveraging the dive…☆950Updated 8 months ago
- TruthfulQA: Measuring How Models Imitate Human Falsehoods☆761Updated 6 months ago
- The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".☆1,542Updated last month
- A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).☆868Updated 2 weeks ago
- 800,000 step-level correctness labels on LLM solutions to MATH problems☆2,021Updated 2 years ago
- The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.☆776Updated last year
- YaRN: Efficient Context Window Extension of Large Language Models☆1,518Updated last year
- A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca)☆1,125Updated last year
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)☆2,676Updated 5 months ago
- A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。☆685Updated last year
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆2,473Updated this week
- Implementation of the training framework proposed in Self-Rewarding Language Model, from MetaAI☆1,394Updated last year
- 🤗 Evaluate: A library for easily evaluating machine learning models and datasets.☆2,259Updated last week
- Expanding natural instructions☆1,007Updated last year