mlfoundations / evalchemyLinks

Automatic evals for LLMs

☆488

Alternatives and similar repositories for evalchemy

Users that are interested in evalchemy are comparing it to the libraries listed below

Sorting:

allenai / olmes
Reproducible, flexible LLM evaluations
☆226Updated 3 weeks ago
huggingface / cosmopedia
☆525Updated 8 months ago
tianyi-lab / Reflection_Tuning
[ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning
☆359Updated 10 months ago
NovaSky-AI / SkyRL
SkyRL: A Modular Full-stack RL Library for LLMs
☆651Updated this week
NVIDIA / NeMo-Skills
A project to improve skills of large language models
☆490Updated this week
WildEval / ZeroEval
A simple unified framework for evaluating LLMs
☆229Updated 3 months ago
sail-sg / oat
🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning, preference learning, etc.
☆418Updated this week
JinjieNi / MixEval
The official evaluation suite and dynamic data release for MixEval.
☆242Updated 8 months ago
knoveleng / open-rs
Official repo for paper: "Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't"
☆245Updated 2 months ago
huggingface / search-and-learn
Recipes to scale inference-time compute of open models
☆1,110Updated 2 months ago
xfactlab / orpo
Official repository for ORPO
☆460Updated last year
microsoft / rStar
☆604Updated 2 weeks ago
allenai / reward-bench
RewardBench: the first evaluation tool for reward models.
☆619Updated last month
allenai / OLMo-core
PyTorch building blocks for the OLMo ecosystem
☆267Updated this week
huggingface / Math-Verify
☆857Updated 3 weeks ago
arcee-ai / DistillKit
An Open Source Toolkit For LLM Distillation
☆698Updated 3 weeks ago
facebookresearch / swe-rl
Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"
☆571Updated 4 months ago
allenai / WildBench
Benchmarking LLMs with Challenging Tasks from Real Users
☆233Updated 8 months ago
FranxYao / Long-Context-Data-Engineering
Implementation of paper Data Engineering for Scaling Language Models to 128K Context
☆468Updated last year
SWE-Gym / SWE-Gym
Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]
☆513Updated this week
TIGER-AI-Lab / MMLU-Pro
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]
☆264Updated 5 months ago
magpie-align / magpie
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …
☆736Updated 4 months ago
facebookresearch / sweet_rl
Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks
☆231Updated 2 months ago
allenai / OLMoE
OLMoE: Open Mixture-of-Experts Language Models
☆823Updated 4 months ago
davanstrien / awesome-synthetic-datasets
awesome synthetic (text) datasets
☆290Updated 3 weeks ago
open-thought / reasoning-gym
procedural reasoning datasets
☆998Updated this week
lmarena / arena-hard-auto
Arena-Hard-Auto: An automatic LLM benchmark.
☆884Updated last month
ezelikman / quiet-star
Code for Quiet-STaR
☆735Updated 11 months ago
LiveCodeBench / LiveCodeBench
Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
☆608Updated 2 weeks ago
Mohammadjafari80 / GSM8K-RLVR
A simplified implementation for experimenting with RLVR on GSM8K, This repository provides a starting point for exploring reasoning.
☆117Updated 5 months ago