mlfoundations / evalchemy
Automatic evals for LLMs
☆376Updated this week
Alternatives and similar repositories for evalchemy:
Users that are interested in evalchemy are comparing it to the libraries listed below
- ☆515Updated 5 months ago
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …☆693Updated last month
- Official repository for ORPO☆450Updated 11 months ago
- ☆671Updated last week
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆1,500Updated this week
- Reproducible, flexible LLM evaluations☆198Updated last month
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆354Updated 8 months ago
- The official evaluation suite and dynamic data release for MixEval.☆238Updated 5 months ago
- RewardBench: the first evaluation tool for reward models.☆562Updated this week
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆459Updated last year
- An Open Source Toolkit For LLM Distillation☆586Updated last week
- Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"☆512Updated last month
- 🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.☆338Updated this week
- A project to improve skills of large language models☆354Updated this week
- ☆287Updated last month
- ☆1,017Updated 4 months ago
- A simple unified framework for evaluating LLMs☆209Updated 3 weeks ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆292Updated this week
- OLMoE: Open Mixture-of-Experts Language Models☆739Updated last month
- awesome synthetic (text) datasets☆278Updated 6 months ago
- Recipes to scale inference-time compute of open models☆1,066Updated 2 months ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆221Updated 6 months ago
- Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"☆301Updated last year
- ☆524Updated 3 weeks ago
- Generative Representational Instruction Tuning☆626Updated last month
- Verifiers for LLM Reinforcement Learning☆881Updated last month
- ☆924Updated 3 months ago
- Large Reasoning Models☆804Updated 5 months ago
- Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.☆409Updated last year
- Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.☆720Updated 7 months ago