Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.
☆2,741Apr 10, 2026Updated this week
Alternatives and similar repositories for helm
Users that are interested in helm are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A framework for few-shot evaluation of language models.☆12,138Apr 8, 2026Updated last week
- Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models☆3,225Jul 19, 2024Updated last year
- Benchmarking large language models' complex reasoning ability with chain-of-thought prompting☆2,770Aug 4, 2024Updated last year
- Measuring Massive Multitask Language Understanding | ICLR 2021☆1,569May 28, 2023Updated 2 years ago
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,966Aug 9, 2025Updated 8 months ago
- Deploy open-source AI quickly and easily - Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.☆18,169Apr 6, 2026Updated last week
- A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)☆4,743Jan 8, 2024Updated 2 years ago
- Train transformer language models with reinforcement learning.☆18,054Updated this week
- 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.☆20,929Updated this week
- Toolkit for creating, sharing and using natural language prompts.☆3,007Oct 23, 2023Updated 2 years ago
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,837Jun 17, 2025Updated 9 months ago
- OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, …☆6,866Updated this week
- Aligning pretrained language models with instruction data generated by themselves.☆4,586Mar 27, 2023Updated 3 years ago
- ☆4,436Jul 31, 2025Updated 8 months ago
- Wordpress hosting with auto-scaling - Free Trial • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Robust recipes to align language models with human and AI preferences☆5,558Apr 8, 2026Updated last week
- General technology for enabling AI capabilities w/ LLMs and MLLMs☆4,340Updated this week
- The RedPajama-Data repository contains code for preparing large datasets for training large language models.☆4,935Dec 7, 2024Updated last year
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆2,374Apr 7, 2026Updated last week
- Ongoing research training transformer models at scale☆15,985Updated this week
- ☆1,561Apr 8, 2026Updated last week
- Fast and memory-efficient exact attention☆23,344Updated this week
- An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.☆39,448Jun 2, 2025Updated 10 months ago
- Code and documentation to train Stanford's Alpaca models, and generate the data.☆30,253Jul 17, 2024Updated last year
- Wordpress hosting with auto-scaling - Free Trial • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- AllenAI's post-training codebase☆3,683Updated this week
- Accessible large language models via k-bit quantization for PyTorch.☆8,107Apr 8, 2026Updated last week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆76,536Updated this week
- This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.☆552Mar 10, 2024Updated 2 years ago
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)☆3,316Feb 8, 2026Updated 2 months ago
- Tools for merging pretrained large language models.☆6,973Mar 15, 2026Updated last month
- ☆772Jun 13, 2024Updated last year
- Minimalistic large language model 3D-parallelism training☆2,644Apr 7, 2026Updated last week
- The hub for EleutherAI's work on interpretability and learning dynamics☆2,768Nov 15, 2025Updated 5 months ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".☆1,595Jun 3, 2025Updated 10 months ago
- 800,000 step-level correctness labels on LLM solutions to MATH problems☆2,115Jun 1, 2023Updated 2 years ago
- Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities☆22,086Jan 23, 2026Updated 2 months ago
- A modular RL library to fine-tune language models to human preferences☆2,384Mar 1, 2024Updated 2 years ago
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆2,983Updated this week
- Large Language Model Text Generation Inference☆10,830Mar 21, 2026Updated 3 weeks ago
- DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.☆42,029Updated this week