lmarena / arena-hard-auto
Arena-Hard-Auto: An automatic LLM benchmark.
☆640Updated this week
Related projects ⓘ
Alternatives and complementary repositories for arena-hard-auto
- ☆447Updated 2 weeks ago
- LiveBench: A Challenging, Contamination-Free LLM Benchmark☆285Updated last week
- ☆493Updated 3 weeks ago
- Code for Quiet-STaR☆641Updated 2 months ago
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆435Updated 7 months ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆788Updated this week
- The official evaluation suite and dynamic data release for MixEval.☆222Updated this week
- Official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality s…☆480Updated last week
- Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"☆292Updated 10 months ago
- ☆781Updated 3 weeks ago
- ☆920Updated this week
- A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).☆739Updated last week
- Official repository for ORPO☆419Updated 5 months ago
- Evaluate your LLM's response with Prometheus and GPT4 💯☆795Updated 2 months ago
- Inference code for Mistral and Mixtral hacked up into original Llama implementation☆373Updated 11 months ago
- Automatically evaluate your LLMs in Google Colab☆557Updated 6 months ago
- Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.☆644Updated last month
- Generative Representational Instruction Tuning☆562Updated this week
- Large Reasoning Models☆492Updated this week
- This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?☆698Updated 2 weeks ago
- [ICML'24 Spotlight] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning☆610Updated 5 months ago
- An Open Source Toolkit For LLM Distillation☆352Updated last month
- A reading list on LLM based Synthetic Data Generation 🔥☆765Updated last week
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆334Updated 2 months ago
- A library for easily merging multiple LLM experts, and efficiently train the merged LLM.☆402Updated 2 months ago
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆201Updated last month
- RewardBench: the first evaluation tool for reward models.☆426Updated 2 weeks ago
- ☆304Updated 3 months ago
- Generate textbook-quality synthetic LLM pretraining data☆488Updated last year
- ☆294Updated 5 months ago