terryyz / llm-benchmark
A list of LLM benchmark frameworks.
☆60Updated 9 months ago
Related projects ⓘ
Alternatives and complementary repositories for llm-benchmark
- Codebase accompanying the Summary of a Haystack paper.☆72Updated 2 months ago
- Just a bunch of benchmark logs for different LLMs☆115Updated 3 months ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆198Updated 2 weeks ago
- A pipeline for LLM knowledge distillation☆78Updated 3 months ago
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆82Updated 2 months ago
- 🚢 Data Toolkit for Sailor Language Models☆82Updated 4 months ago
- Expert Specialized Fine-Tuning☆148Updated 2 months ago
- Code accompanying "How I learned to start worrying about prompt formatting".☆95Updated last month
- Scalable Meta-Evaluation of LLMs as Evaluators☆41Updated 9 months ago
- ☆73Updated 10 months ago
- Advanced Reasoning Benchmark Dataset for LLMs☆45Updated last year
- Evaluating LLMs with fewer examples☆135Updated 7 months ago
- Small and Efficient Mathematical Reasoning LLMs☆71Updated 9 months ago
- ☆112Updated last month
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆124Updated 3 weeks ago
- ☆35Updated last year
- ☆72Updated last year
- Data preparation code for Amber 7B LLM☆83Updated 6 months ago
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]☆131Updated last week
- ☆83Updated last year
- MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents [EMNLP 2024]☆104Updated last month
- "Improving Mathematical Reasoning with Process Supervision" by OPENAI☆83Updated last week
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆129Updated 2 months ago
- A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.☆76Updated 9 months ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆115Updated 2 weeks ago
- Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models☆87Updated last year
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning"☆91Updated 4 months ago
- [ICLR 2024] Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation☆146Updated 8 months ago
- Official repo of Respond-and-Respond: data, code, and evaluation☆98Updated 3 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆115Updated last month