Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.
☆2,798May 20, 2026Updated this week
Alternatives and similar repositories for helm
Users that are interested in helm are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A framework for few-shot evaluation of language models.☆12,678May 11, 2026Updated 2 weeks ago
- Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models☆3,241Jul 19, 2024Updated last year
- Benchmarking large language models' complex reasoning ability with chain-of-thought prompting☆2,773Aug 4, 2024Updated last year
- Measuring Massive Multitask Language Understanding | ICLR 2021☆1,579May 28, 2023Updated 2 years ago
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,985Aug 9, 2025Updated 9 months ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.☆18,517Apr 14, 2026Updated last month
- A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)☆4,749Jan 8, 2024Updated 2 years ago
- Train transformer language models with reinforcement learning.☆18,411May 19, 2026Updated last week
- 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.☆21,187Updated this week
- Toolkit for creating, sharing and using natural language prompts.☆3,019Oct 23, 2023Updated 2 years ago
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,841Jun 17, 2025Updated 11 months ago
- OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, …☆7,025Updated this week
- Aligning pretrained language models with instruction data generated by themselves.☆4,600Mar 27, 2023Updated 3 years ago
- ☆4,492Apr 22, 2026Updated last month
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- Robust recipes to align language models with human and AI preferences☆5,605Apr 8, 2026Updated last month
- General technology for enabling AI capabilities w/ LLMs and MLLMs☆4,394Updated this week
- The RedPajama-Data repository contains code for preparing large datasets for training large language models.☆4,942Dec 7, 2024Updated last year
- Ongoing research training transformer models at scale☆16,427Updated this week
- ☆1,564May 19, 2026Updated last week
- An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.☆39,484May 1, 2026Updated 3 weeks ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆2,426Updated this week
- Fast and memory-efficient exact attention☆23,917Updated this week
- Code and documentation to train Stanford's Alpaca models, and generate the data.☆30,248Jul 17, 2024Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- AllenAI's post-training codebase☆3,729Updated this week
- Accessible large language models via k-bit quantization for PyTorch.☆8,216Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆80,418May 19, 2026Updated last week
- This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.☆553Mar 10, 2024Updated 2 years ago
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)☆3,444Feb 8, 2026Updated 3 months ago
- Tools for merging pretrained large language models.☆7,100May 6, 2026Updated 3 weeks ago
- ☆772Jun 13, 2024Updated last year
- The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".☆1,600Apr 17, 2026Updated last month
- Minimalistic large language model 3D-parallelism training☆2,698Apr 7, 2026Updated last month
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- The hub for EleutherAI's work on interpretability and learning dynamics☆2,806Nov 15, 2025Updated 6 months ago
- 800,000 step-level correctness labels on LLM solutions to MATH problems☆2,133Jun 1, 2023Updated 2 years ago
- Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities☆22,133Jan 23, 2026Updated 4 months ago
- A modular RL library to fine-tune language models to human preferences☆2,386Mar 1, 2024Updated 2 years ago
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆3,066May 6, 2026Updated 3 weeks ago
- Large Language Model Text Generation Inference☆10,856Mar 21, 2026Updated 2 months ago
- DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.☆42,386Updated this week