stanford-crfm / helmLinks
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.
☆2,432Updated this week
Alternatives and similar repositories for helm
Users that are interested in helm are comparing it to the libraries listed below
Sorting:
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,840Updated 3 weeks ago
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,777Updated 2 months ago
- Measuring Massive Multitask Language Understanding | ICLR 2021☆1,483Updated 2 years ago
- General technology for enabling AI capabilities w/ LLMs and MLLMs☆4,096Updated 2 months ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆1,993Updated last year
- The hub for EleutherAI's work on interpretability and learning dynamics☆2,602Updated 2 months ago
- Data and tools for generating and inspecting OLMo pre-training data.☆1,303Updated last week
- Benchmarking large language models' complex reasoning ability with chain-of-thought prompting☆2,748Updated last year
- ☆1,538Updated last week
- YaRN: Efficient Context Window Extension of Large Language Models☆1,589Updated last year
- 800,000 step-level correctness labels on LLM solutions to MATH problems☆2,041Updated 2 years ago
- Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models☆3,108Updated last year
- Code for our EMNLP 2023 Paper: "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models"☆1,196Updated last year
- A family of open-sourced Mixture-of-Experts (MoE) Large Language Models☆1,583Updated last year
- TruthfulQA: Measuring How Models Imitate Human Falsehoods☆796Updated 7 months ago
- The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".☆1,556Updated 2 months ago
- Toolkit for creating, sharing and using natural language prompts.☆2,923Updated last year
- AllenAI's post-training codebase☆3,144Updated this week
- Code for the paper "Evaluating Large Language Models Trained on Code"☆2,901Updated 7 months ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆1,851Updated this week
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters☆1,852Updated last year
- A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca)☆1,128Updated last year
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆2,558Updated 2 weeks ago
- A modular RL library to fine-tune language models to human preferences☆2,346Updated last year
- A framework for the evaluation of autoregressive code generation language models.☆975Updated last month
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)☆2,765Updated 7 months ago
- [ACL2023] We introduce LLM-Blender, an innovative ensembling framework to attain consistently superior performance by leveraging the dive…☆958Updated 10 months ago
- Reference implementation for DPO (Direct Preference Optimization)☆2,717Updated last year
- MTEB: Massive Text Embedding Benchmark☆2,785Updated last week
- Original Implementation of Prompt Tuning from Lester, et al, 2021☆693Updated 5 months ago