Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.
☆2,718Mar 20, 2026Updated last week
Alternatives and similar repositories for helm
Users that are interested in helm are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A framework for few-shot evaluation of language models.☆11,802Mar 18, 2026Updated last week
- Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models☆3,219Jul 19, 2024Updated last year
- Benchmarking large language models' complex reasoning ability with chain-of-thought prompting☆2,770Aug 4, 2024Updated last year
- Measuring Massive Multitask Language Understanding | ICLR 2021☆1,569May 28, 2023Updated 2 years ago
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,961Aug 9, 2025Updated 7 months ago
- End-to-end encrypted email - Proton Mail • AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.☆18,046Nov 3, 2025Updated 4 months ago
- A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)☆4,742Jan 8, 2024Updated 2 years ago
- 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.☆20,841Mar 18, 2026Updated last week
- Train transformer language models with reinforcement learning.☆17,781Updated this week
- Toolkit for creating, sharing and using natural language prompts.☆3,007Oct 23, 2023Updated 2 years ago
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,832Jun 17, 2025Updated 9 months ago
- OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, …☆6,788Updated this week
- Aligning pretrained language models with instruction data generated by themselves.☆4,587Mar 27, 2023Updated 2 years ago
- ☆4,406Jul 31, 2025Updated 7 months ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Robust recipes to align language models with human and AI preferences☆5,535Sep 8, 2025Updated 6 months ago
- General technology for enabling AI capabilities w/ LLMs and MLLMs☆4,310Updated this week
- The RedPajama-Data repository contains code for preparing large datasets for training large language models.☆4,930Dec 7, 2024Updated last year
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆2,353Mar 9, 2026Updated 2 weeks ago
- ☆1,559Updated this week
- Ongoing research training transformer models at scale☆15,744Mar 20, 2026Updated last week
- An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.☆39,445Jun 2, 2025Updated 9 months ago
- Fast and memory-efficient exact attention☆22,938Updated this week
- Code and documentation to train Stanford's Alpaca models, and generate the data.☆30,256Jul 17, 2024Updated last year
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- AllenAI's post-training codebase☆3,643Updated this week
- Accessible large language models via k-bit quantization for PyTorch.☆8,078Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆74,135Updated this week
- This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.☆552Mar 10, 2024Updated 2 years ago
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)☆3,253Feb 8, 2026Updated last month
- Tools for merging pretrained large language models.☆6,895Mar 15, 2026Updated last week
- Minimalistic large language model 3D-parallelism training☆2,617Feb 19, 2026Updated last month
- ☆771Jun 13, 2024Updated last year
- The hub for EleutherAI's work on interpretability and learning dynamics☆2,751Nov 15, 2025Updated 4 months ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".☆1,591Jun 3, 2025Updated 9 months ago
- 800,000 step-level correctness labels on LLM solutions to MATH problems☆2,106Jun 1, 2023Updated 2 years ago
- Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities☆22,059Jan 23, 2026Updated 2 months ago
- A modular RL library to fine-tune language models to human preferences☆2,383Mar 1, 2024Updated 2 years ago
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆2,965Mar 16, 2026Updated last week
- Large Language Model Text Generation Inference☆10,812Jan 8, 2026Updated 2 months ago
- DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.☆41,869Mar 18, 2026Updated last week