Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.
☆2,826Jun 5, 2026Updated last week
Alternatives and similar repositories for helm
Users that are interested in helm are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A framework for few-shot evaluation of language models.☆12,971Jun 2, 2026Updated 2 weeks ago
- Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models☆3,247Jul 19, 2024Updated last year
- Benchmarking large language models' complex reasoning ability with chain-of-thought prompting☆2,773Aug 4, 2024Updated last year
- Measuring Massive Multitask Language Understanding | ICLR 2021☆1,588May 28, 2023Updated 3 years ago
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,996Aug 9, 2025Updated 10 months ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.☆18,688Apr 14, 2026Updated 2 months ago
- A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)☆4,749Jan 8, 2024Updated 2 years ago
- Train transformer language models with reinforcement learning.☆18,663Updated this week
- 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.☆21,273Updated this week
- Toolkit for creating, sharing and using natural language prompts.☆3,025Oct 23, 2023Updated 2 years ago
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,839Jun 17, 2025Updated last year
- OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, …☆7,094Updated this week
- Aligning pretrained language models with instruction data generated by themselves.☆4,600Mar 27, 2023Updated 3 years ago
- ☆4,521Apr 22, 2026Updated last month
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Robust recipes to align language models with human and AI preferences☆5,613May 26, 2026Updated 3 weeks ago
- General technology for enabling AI capabilities w/ LLMs and MLLMs☆4,417Updated this week
- The RedPajama-Data repository contains code for preparing large datasets for training large language models.☆4,948Jun 3, 2026Updated 2 weeks ago
- ☆1,565Jun 10, 2026Updated last week
- Ongoing research training transformer models at scale☆16,687Updated this week
- An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.☆39,476May 1, 2026Updated last month
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆2,447Jun 9, 2026Updated last week
- Fast and memory-efficient exact attention☆24,170Updated this week
- Code and documentation to train Stanford's Alpaca models, and generate the data.☆30,246Jul 17, 2024Updated last year
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- AllenAI's post-training codebase☆3,751Updated this week
- Accessible large language models via k-bit quantization for PyTorch.☆8,263Jun 11, 2026Updated last week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆83,135Updated this week
- This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.☆553Mar 10, 2024Updated 2 years ago
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)☆3,492Feb 8, 2026Updated 4 months ago
- Tools for merging pretrained large language models.☆7,154Updated this week
- ☆775Jun 13, 2024Updated 2 years ago
- The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".☆1,600Apr 17, 2026Updated 2 months ago
- Minimalistic large language model 3D-parallelism training☆2,715May 26, 2026Updated 3 weeks ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- The hub for EleutherAI's work on interpretability and learning dynamics☆2,819Nov 15, 2025Updated 7 months ago
- 800,000 step-level correctness labels on LLM solutions to MATH problems☆2,143Jun 1, 2023Updated 3 years ago
- Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities☆22,149Jan 23, 2026Updated 4 months ago
- A modular RL library to fine-tune language models to human preferences☆2,388Mar 1, 2024Updated 2 years ago
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆3,091May 26, 2026Updated 3 weeks ago
- Large Language Model Text Generation Inference☆10,863Mar 21, 2026Updated 2 months ago
- DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.☆42,508Updated this week