stanford-crfm / helmLinks
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.
☆2,250Updated this week
Alternatives and similar repositories for helm
Users that are interested in helm are comparing it to the libraries listed below
Sorting:
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,747Updated last year
- Measuring Massive Multitask Language Understanding | ICLR 2021☆1,424Updated 2 years ago
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,763Updated 5 months ago
- The hub for EleutherAI's work on interpretability and learning dynamics☆2,524Updated this week
- A modular RL library to fine-tune language models to human preferences☆2,309Updated last year
- Reference implementation for DPO (Direct Preference Optimization)☆2,587Updated 9 months ago
- ☆1,521Updated last month
- Toolkit for creating, sharing and using natural language prompts.☆2,872Updated last year
- A family of open-sourced Mixture-of-Experts (MoE) Large Language Models☆1,534Updated last year
- Aligning pretrained language models with instruction data generated by themselves.☆4,385Updated 2 years ago
- AllenAI's post-training codebase☆2,993Updated this week
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆1,391Updated last year
- A framework for few-shot evaluation of language models.☆9,126Updated this week
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…☆2,724Updated this week
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆1,881Updated 9 months ago
- A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)☆4,659Updated last year
- Benchmarking large language models' complex reasoning ability with chain-of-thought prompting☆2,729Updated 10 months ago
- The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.☆766Updated last year
- Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models☆3,051Updated 10 months ago
- Reading list of Instruction-tuning. A trend starts from Natrural-Instruction (ACL 2022), FLAN (ICLR 2022) and T0 (ICLR 2022).☆769Updated last year
- YaRN: Efficient Context Window Extension of Large Language Models☆1,489Updated last year
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆1,574Updated last week
- General technology for enabling AI capabilities w/ LLMs and MLLMs☆4,008Updated last week
- The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".☆1,528Updated 2 months ago
- distributed trainer for LLMs☆575Updated last year
- Accessible large language models via k-bit quantization for PyTorch.☆7,088Updated last week
- Code for our EMNLP 2023 Paper: "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models"☆1,173Updated last year
- TruthfulQA: Measuring How Models Imitate Human Falsehoods☆738Updated 4 months ago
- A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).☆851Updated last week
- 800,000 step-level correctness labels on LLM solutions to MATH problems☆2,001Updated 2 years ago