stanford-crfm / helmLinks

Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.

☆2,522

Alternatives and similar repositories for helm

Users that are interested in helm are comparing it to the libraries listed below

Sorting:

tatsu-lab / alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
☆1,891Updated 2 months ago
hendrycks / test
Measuring Massive Multitask Language Understanding | ICLR 2021
☆1,508Updated 2 years ago
microsoft / LMOps
General technology for enabling AI capabilities w/ LLMs and MLLMs
☆4,160Updated 4 months ago
google-research / FLAN
☆1,549Updated 2 months ago
anthropics / hh-rlhf
Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
☆1,791Updated 4 months ago
huggingface / lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
☆2,044Updated this week
google / BIG-bench
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
☆3,137Updated last year
allenai / dolma
Data and tools for generating and inspecting OLMo pre-training data.
☆1,338Updated last month
EleutherAI / pythia
The hub for EleutherAI's work on interpretability and learning dynamics
☆2,654Updated 4 months ago
bigscience-workshop / promptsource
Toolkit for creating, sharing and using natural language prompts.
☆2,960Updated 2 years ago
gkamradt / LLMTest_NeedleInAHaystack
Doing simple retrieval from LLM models at various context lengths to measure accuracy
☆2,060Updated last year
allenai / open-instruct
AllenAI's post-training codebase
☆3,280Updated this week
FranxYao / chain-of-thought-hub
Benchmarking large language models' complex reasoning ability with chain-of-thought prompting
☆2,751Updated last year
openai / prm800k
800,000 step-level correctness labels on LLM solutions to MATH problems
☆2,061Updated 2 years ago
MLGroupJLU / LLM-eval-survey
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
☆1,575Updated 5 months ago
jquesnelle / yarn
YaRN: Efficient Context Window Extension of Large Language Models
☆1,623Updated last year
EleutherAI / lm-evaluation-harness
A framework for few-shot evaluation of language models.
☆10,488Updated this week
sylinrl / TruthfulQA
TruthfulQA: Measuring How Models Imitate Human Falsehoods
☆830Updated 9 months ago
XueFuzhao / OpenMoE
A family of open-sourced Mixture-of-Experts (MoE) Large Language Models
☆1,620Updated last year
huggingface / datatrove
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆2,699Updated 2 weeks ago
AGI-Edgerunners / LLM-Adapters
Code for our EMNLP 2023 Paper: "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models"
☆1,203Updated last year
FasterDecoding / Medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
☆2,650Updated last year
huggingface / nanotron
Minimalistic large language model 3D-parallelism training
☆2,274Updated 2 months ago
huggingface / alignment-handbook
Robust recipes to align language models with human and AI preferences
☆5,412Updated last month
eric-mitchell / direct-preference-optimization
Reference implementation for DPO (Direct Preference Optimization)
☆2,767Updated last year
huggingface / evaluate
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
☆2,355Updated last month
yizhongw / self-instruct
Aligning pretrained language models with instruction data generated by themselves.
☆4,507Updated 2 years ago
S-LoRA / S-LoRA
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
☆1,864Updated last year
bigcode-project / bigcode-evaluation-harness
A framework for the evaluation of autoregressive code generation language models.
☆986Updated 3 months ago
argilla-io / distilabel
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…
☆2,912Updated this week