stanford-crfm / helmLinks
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.
☆2,499Updated last week
Alternatives and similar repositories for helm
Users that are interested in helm are comparing it to the libraries listed below
Sorting:
- Measuring Massive Multitask Language Understanding | ICLR 2021☆1,503Updated 2 years ago
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,870Updated 2 months ago
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,787Updated 3 months ago
- ☆1,548Updated last month
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆2,050Updated last year
- Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models☆3,127Updated last year
- TruthfulQA: Measuring How Models Imitate Human Falsehoods☆815Updated 8 months ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆1,987Updated this week
- The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".☆1,566Updated 4 months ago
- General technology for enabling AI capabilities w/ LLMs and MLLMs☆4,147Updated 3 months ago
- The hub for EleutherAI's work on interpretability and learning dynamics☆2,628Updated 4 months ago
- 🤗 Evaluate: A library for easily evaluating machine learning models and datasets.☆2,335Updated 2 weeks ago
- Data and tools for generating and inspecting OLMo pre-training data.☆1,323Updated 2 weeks ago
- YaRN: Efficient Context Window Extension of Large Language Models☆1,615Updated last year
- Toolkit for creating, sharing and using natural language prompts.☆2,945Updated last year
- Benchmarking large language models' complex reasoning ability with chain-of-thought prompting☆2,748Updated last year
- A family of open-sourced Mixture-of-Experts (MoE) Large Language Models☆1,608Updated last year
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters☆1,854Updated last year
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…☆2,895Updated last week
- 800,000 step-level correctness labels on LLM solutions to MATH problems☆2,054Updated 2 years ago
- AllenAI's post-training codebase☆3,232Updated this week
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆2,663Updated this week
- Code for our EMNLP 2023 Paper: "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models"☆1,200Updated last year
- ☆1,048Updated last year
- MTEB: Massive Text Embedding Benchmark☆2,888Updated this week
- A framework for few-shot evaluation of language models.☆10,303Updated this week
- Stanford NLP Python library for Representation Finetuning (ReFT)☆1,511Updated 8 months ago
- A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).☆888Updated 2 weeks ago
- Accessible large language models via k-bit quantization for PyTorch.☆7,647Updated last week
- The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.☆780Updated last year