stanford-crfm / helmLinks
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.
☆2,604Updated last week
Alternatives and similar repositories for helm
Users that are interested in helm are comparing it to the libraries listed below
Sorting:
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,932Updated 4 months ago
- Measuring Massive Multitask Language Understanding | ICLR 2021☆1,537Updated 2 years ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆2,126Updated last year
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,809Updated 6 months ago
- General technology for enabling AI capabilities w/ LLMs and MLLMs☆4,238Updated 2 weeks ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆2,226Updated 3 weeks ago
- The hub for EleutherAI's work on interpretability and learning dynamics☆2,702Updated last month
- Data and tools for generating and inspecting OLMo pre-training data.☆1,384Updated last month
- Benchmarking large language models' complex reasoning ability with chain-of-thought prompting☆2,763Updated last year
- ☆1,556Updated 2 weeks ago
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆2,795Updated last week
- A family of open-sourced Mixture-of-Experts (MoE) Large Language Models☆1,648Updated last year
- Toolkit for creating, sharing and using natural language prompts.☆2,987Updated 2 years ago
- The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".☆1,586Updated 7 months ago
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…☆3,008Updated last week
- TruthfulQA: Measuring How Models Imitate Human Falsehoods☆859Updated 11 months ago
- YaRN: Efficient Context Window Extension of Large Language Models☆1,654Updated last year
- AllenAI's post-training codebase☆3,488Updated this week
- 800,000 step-level correctness labels on LLM solutions to MATH problems☆2,082Updated 2 years ago
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters☆1,884Updated last year
- Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models☆3,182Updated last year
- Code for our EMNLP 2023 Paper: "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models"☆1,222Updated last year
- Stanford NLP Python library for Representation Finetuning (ReFT)☆1,548Updated this week
- 🤗 Evaluate: A library for easily evaluating machine learning models and datasets.☆2,388Updated last month
- A framework for few-shot evaluation of language models.☆11,069Updated last week
- Robust recipes to align language models with human and AI preferences☆5,466Updated 3 months ago
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.☆2,085Updated 6 months ago
- A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).☆898Updated 3 months ago
- Minimalistic large language model 3D-parallelism training☆2,396Updated 3 weeks ago
- [ACL2023] We introduce LLM-Blender, an innovative ensembling framework to attain consistently superior performance by leveraging the dive…☆974Updated last year