mlfoundations / evalchemyLinks
Automatic evals for LLMs
☆519Updated last month
Alternatives and similar repositories for evalchemy
Users that are interested in evalchemy are comparing it to the libraries listed below
Sorting:
- Reproducible, flexible LLM evaluations☆237Updated last month
- ☆536Updated 9 months ago
- A simple unified framework for evaluating LLMs☆240Updated 4 months ago
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆360Updated 11 months ago
- Recipes to scale inference-time compute of open models☆1,112Updated 3 months ago
- 🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning, preference learning, etc.☆433Updated last week
- SkyRL: A Modular Full-stack RL Library for LLMs☆738Updated this week
- Official repository for ORPO☆462Updated last year
- RewardBench: the first evaluation tool for reward models.☆624Updated 2 months ago
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]☆522Updated 3 weeks ago
- A project to improve skills of large language models☆529Updated this week
- Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"☆591Updated 5 months ago
- ☆893Updated last month
- The official evaluation suite and dynamic data release for MixEval.☆244Updated 9 months ago
- ☆621Updated last month
- Official repo for paper: "Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't"☆254Updated 3 months ago
- An Open Source Toolkit For LLM Distillation☆712Updated last month
- Code and Data for Tau-Bench☆779Updated last month
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆470Updated last year
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …☆756Updated 5 months ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆237Updated 9 months ago
- Scaling Data for SWE-agents☆371Updated this week
- procedural reasoning datasets☆1,060Updated last week
- PyTorch building blocks for the OLMo ecosystem☆274Updated this week
- [COLM 2025] LIMO: Less is More for Reasoning☆1,006Updated 3 weeks ago
- LOFT: A 1 Million+ Token Long-Context Benchmark☆209Updated 2 months ago
- Code for Quiet-STaR☆737Updated last year
- xLAM: A Family of Large Action Models to Empower AI Agent Systems☆537Updated this week
- ☆1,033Updated 8 months ago
- Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"☆308Updated last year