stanford-crfm / helmView external linksLinks
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.
☆2,667Updated this week
Alternatives and similar repositories for helm
Users that are interested in helm are comparing it to the libraries listed below
Sorting:
- A framework for few-shot evaluation of language models.☆11,393Updated this week
- Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models☆3,199Jul 19, 2024Updated last year
- Benchmarking large language models' complex reasoning ability with chain-of-thought prompting☆2,768Aug 4, 2024Updated last year
- Measuring Massive Multitask Language Understanding | ICLR 2021☆1,552May 28, 2023Updated 2 years ago
- A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)☆4,742Jan 8, 2024Updated 2 years ago
- Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.☆17,663Nov 3, 2025Updated 3 months ago
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,946Aug 9, 2025Updated 6 months ago
- Train transformer language models with reinforcement learning.☆17,360Updated this week
- 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.☆20,619Updated this week
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,814Jun 17, 2025Updated 7 months ago
- General technology for enabling AI capabilities w/ LLMs and MLLMs☆4,284Dec 22, 2025Updated last month
- ☆4,346Jul 31, 2025Updated 6 months ago
- Toolkit for creating, sharing and using natural language prompts.☆2,997Oct 23, 2023Updated 2 years ago
- Robust recipes to align language models with human and AI preferences☆5,495Sep 8, 2025Updated 5 months ago
- Aligning pretrained language models with instruction data generated by themselves.☆4,573Mar 27, 2023Updated 2 years ago
- Ongoing research training transformer models at scale☆15,162Updated this week
- ☆1,559Feb 5, 2026Updated last week
- OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, …☆6,663Updated this week
- The RedPajama-Data repository contains code for preparing large datasets for training large language models.☆4,924Dec 7, 2024Updated last year
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)☆3,151Nov 17, 2025Updated 2 months ago
- Minimalistic large language model 3D-parallelism training☆2,544Dec 11, 2025Updated 2 months ago
- Fast and memory-efficient exact attention☆22,231Updated this week
- Accessible large language models via k-bit quantization for PyTorch.☆7,952Updated this week
- AllenAI's post-training codebase☆3,573Updated this week
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆2,293Jan 21, 2026Updated 3 weeks ago
- Code and documentation to train Stanford's Alpaca models, and generate the data.☆30,266Jul 17, 2024Updated last year
- 800,000 step-level correctness labels on LLM solutions to MATH problems☆2,092Jun 1, 2023Updated 2 years ago
- Tools for merging pretrained large language models.☆6,783Jan 26, 2026Updated 2 weeks ago
- An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.☆39,402Jun 2, 2025Updated 8 months ago
- The hub for EleutherAI's work on interpretability and learning dynamics☆2,731Nov 15, 2025Updated 2 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆70,205Updated this week
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆2,885Updated this week
- Large Language Model Text Generation Inference☆10,757Jan 8, 2026Updated last month
- ☆772Jun 13, 2024Updated last year
- Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities☆22,021Jan 23, 2026Updated 3 weeks ago
- A modular RL library to fine-tune language models to human preferences☆2,377Mar 1, 2024Updated last year
- This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.☆551Mar 10, 2024Updated last year
- DSPy: The framework for programming—not prompting—language models☆32,156Updated this week
- Modeling, training, eval, and inference code for OLMo☆6,306Nov 24, 2025Updated 2 months ago