lmarena / arena-hard-autoLinks
Arena-Hard-Auto: An automatic LLM benchmark.
☆978Updated 6 months ago
Alternatives and similar repositories for arena-hard-auto
Users that are interested in arena-hard-auto are comparing it to the libraries listed below
Sorting:
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆758Updated 5 months ago
- Automatic evals for LLMs☆574Updated 3 weeks ago
- ☆557Updated last year
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …☆817Updated 9 months ago
- RewardBench: the first evaluation tool for reward models.☆675Updated 7 months ago
- Code for Quiet-STaR☆742Updated last year
- Recipes to scale inference-time compute of open models☆1,123Updated 7 months ago
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]☆322Updated last month
- ☆1,344Updated last year
- ☆1,032Updated last year
- ☆968Updated 11 months ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆2,233Updated last week
- Official repository for ORPO☆468Updated last year
- ☆1,067Updated this week
- Large Reasoning Models☆804Updated last year
- LiveBench: A Challenging, Contamination-Free LLM Benchmark☆1,005Updated this week
- An Open Large Reasoning Model for Real-World Solutions☆1,535Updated 7 months ago
- [NeurIPS'25] Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"☆654Updated 9 months ago
- [NeurIPS 2024] SimPO: Simple Preference Optimization with a Reference-Free Reward☆937Updated 10 months ago
- Official repo for the paper "Scaling Synthetic Data Creation with 1,000,000,000 Personas"☆1,437Updated 10 months ago
- Code and Data for Tau-Bench☆1,048Updated 4 months ago
- Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.☆751Updated last year
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,933Updated 5 months ago
- [COLM 2025] LIMO: Less is More for Reasoning☆1,061Updated 5 months ago
- FuseAI Project☆585Updated 11 months ago
- OLMoE: Open Mixture-of-Experts Language Models☆945Updated 3 months ago
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark☆455Updated last year
- Scalable RL solution for advanced reasoning of language models☆1,794Updated 9 months ago
- The official implementation of Self-Play Fine-Tuning (SPIN)☆1,230Updated last year
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them☆541Updated last year