terryyz / llm-benchmarkLinks
A list of LLM benchmark frameworks.
☆66Updated last year
Alternatives and similar repositories for llm-benchmark
Users that are interested in llm-benchmark are comparing it to the libraries listed below
Sorting:
- Spherical Merge Pytorch/HF format Language Models with minimal feature loss.☆123Updated last year
- Evaluating LLMs with fewer examples☆155Updated last year
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]☆247Updated 3 months ago
- Simple extension on vLLM to help you speed up reasoning model without training.☆152Updated this week
- [ICLR 2024] Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation☆170Updated last year
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]☆106Updated 3 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆140Updated 7 months ago
- ☆197Updated 5 months ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆116Updated 11 months ago
- Astraios: Parameter-Efficient Instruction Tuning Code Language Models☆58Updated last year
- Verifiers for LLM Reinforcement Learning☆55Updated last month
- Benchmarking LLMs with Challenging Tasks from Real Users☆223Updated 7 months ago
- ☆84Updated last year
- Codebase accompanying the Summary of a Haystack paper.☆78Updated 8 months ago
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)☆105Updated this week
- evol augment any dataset online☆59Updated last year
- A pipeline for LLM knowledge distillation☆104Updated 2 months ago
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆143Updated 8 months ago
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆121Updated 11 months ago
- Mixing Language Models with Self-Verification and Meta-Verification☆104Updated 5 months ago
- ☆76Updated last year
- ☆95Updated 8 months ago
- Evaluating LLMs with CommonGen-Lite☆90Updated last year
- ☆309Updated 11 months ago
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆142Updated 7 months ago
- ☆109Updated 2 months ago
- Model, Code & Data for the EMNLP'23 paper "Making Large Language Models Better Data Creators"☆131Updated last year
- ☆83Updated last year
- Code accompanying "How I learned to start worrying about prompt formatting".☆105Updated 8 months ago
- [EMNLP 2023] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning☆241Updated last year