lmarena / arena-hard-autoLinks

Arena-Hard-Auto: An automatic LLM benchmark.

☆963

Alternatives and similar repositories for arena-hard-auto

Users that are interested in arena-hard-auto are comparing it to the libraries listed below

Sorting:

huggingface / cosmopedia
☆556Updated last year
LiveCodeBench / LiveCodeBench
Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
☆723Updated 4 months ago
mlfoundations / evalchemy
Automatic evals for LLMs
☆559Updated 5 months ago
magpie-align / magpie
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …
☆793Updated 8 months ago
huggingface / search-and-learn
Recipes to scale inference-time compute of open models
☆1,118Updated 6 months ago
allenai / OLMoE
OLMoE: Open Mixture-of-Experts Language Models
☆916Updated 2 months ago
fanqiwan / FuseAI
FuseAI Project
☆584Updated 10 months ago
ezelikman / quiet-star
Code for Quiet-STaR
☆742Updated last year
arcee-ai / DistillKit
An Open Source Toolkit For LLM Distillation
☆785Updated 4 months ago
zhentingqi / rStar
☆966Updated 10 months ago
TIGER-AI-Lab / MMLU-Pro
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]
☆315Updated last week
allenai / reward-bench
RewardBench: the first evaluation tool for reward models.
☆660Updated 5 months ago
jzhang38 / EasyContext
Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.
☆750Updated last year
trotsky1997 / MathBlackBox
☆1,035Updated 11 months ago
Open-Source-O1 / Open-O1
☆1,348Updated last year
SimpleBerry / LLaMA-O1
Large Reasoning Models
☆807Updated last year
GAIR-NLP / LIMO
[COLM 2025] LIMO: Less is More for Reasoning
☆1,053Updated 4 months ago
huggingface / Math-Verify
☆1,015Updated 5 months ago
xfactlab / orpo
Official repository for ORPO
☆467Updated last year
huggingface / lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
☆2,141Updated last week
princeton-nlp / SimPO
[NeurIPS 2024] SimPO: Simple Preference Optimization with a Reference-Free Reward
☆931Updated 9 months ago
AIDC-AI / Marco-o1
An Open Large Reasoning Model for Real-World Solutions
☆1,528Updated 6 months ago
ContextualAI / HALOs
A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).
☆894Updated 2 months ago
facebookresearch / swe-rl
[NeurIPS'25] Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"
☆624Updated 8 months ago
LiveBench / LiveBench
LiveBench: A Challenging, Contamination-Free LLM Benchmark
☆951Updated this week
NVIDIA-NeMo / Skills
A project to improve skills of large language models
☆628Updated this week
sierra-research / tau-bench
Code and Data for Tau-Bench
☆970Updated 3 months ago
FranxYao / Long-Context-Data-Engineering
Implementation of paper Data Engineering for Scaling Language Models to 128K Context
☆478Updated last year
tatsu-lab / alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
☆1,918Updated 3 months ago
Leeroo-AI / mergoo
A library for easily merging multiple LLM experts, and efficiently train the merged LLM.
☆498Updated last year