Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.
☆2,775May 1, 2026Updated this week
Alternatives and similar repositories for helm
Users that are interested in helm are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A framework for few-shot evaluation of language models.☆12,411Updated this week
- Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models☆3,234Jul 19, 2024Updated last year
- Benchmarking large language models' complex reasoning ability with chain-of-thought prompting☆2,771Aug 4, 2024Updated last year
- Measuring Massive Multitask Language Understanding | ICLR 2021☆1,580May 28, 2023Updated 2 years ago
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,982Aug 9, 2025Updated 8 months ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.☆18,330Apr 14, 2026Updated 3 weeks ago
- A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)☆4,745Jan 8, 2024Updated 2 years ago
- Train transformer language models with reinforcement learning.☆18,282Updated this week
- 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.☆21,052Updated this week
- Toolkit for creating, sharing and using natural language prompts.☆3,010Oct 23, 2023Updated 2 years ago
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,840Jun 17, 2025Updated 10 months ago
- OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, …☆6,959Apr 20, 2026Updated 2 weeks ago
- Aligning pretrained language models with instruction data generated by themselves.☆4,595Mar 27, 2023Updated 3 years ago
- ☆4,471Apr 22, 2026Updated 2 weeks ago
- Open source password manager - Proton Pass • AdSecurely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
- Robust recipes to align language models with human and AI preferences☆5,593Apr 8, 2026Updated 3 weeks ago
- General technology for enabling AI capabilities w/ LLMs and MLLMs☆4,368Updated this week
- The RedPajama-Data repository contains code for preparing large datasets for training large language models.☆4,941Dec 7, 2024Updated last year
- ☆1,562Apr 18, 2026Updated 2 weeks ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆2,396Apr 17, 2026Updated 2 weeks ago
- Ongoing research training transformer models at scale☆16,203Updated this week
- An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.☆39,463Updated this week
- Fast and memory-efficient exact attention☆23,628Updated this week
- Code and documentation to train Stanford's Alpaca models, and generate the data.☆30,258Jul 17, 2024Updated last year
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- AllenAI's post-training codebase☆3,708Updated this week
- Accessible large language models via k-bit quantization for PyTorch.☆8,178Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆78,979Updated this week
- This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.☆553Mar 10, 2024Updated 2 years ago
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)☆3,377Feb 8, 2026Updated 2 months ago
- Tools for merging pretrained large language models.☆7,052Mar 15, 2026Updated last month
- ☆772Jun 13, 2024Updated last year
- The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".☆1,602Apr 17, 2026Updated 2 weeks ago
- Minimalistic large language model 3D-parallelism training☆2,674Apr 7, 2026Updated 3 weeks ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- The hub for EleutherAI's work on interpretability and learning dynamics☆2,789Nov 15, 2025Updated 5 months ago
- 800,000 step-level correctness labels on LLM solutions to MATH problems☆2,126Jun 1, 2023Updated 2 years ago
- Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities☆22,114Jan 23, 2026Updated 3 months ago
- A modular RL library to fine-tune language models to human preferences☆2,387Mar 1, 2024Updated 2 years ago
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆3,033Apr 20, 2026Updated 2 weeks ago
- Large Language Model Text Generation Inference☆10,848Mar 21, 2026Updated last month
- DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.☆42,231Updated this week