terryyz / llm-benchmark
A list of LLM benchmark frameworks.
☆64Updated last year
Alternatives and similar repositories for llm-benchmark:
Users that are interested in llm-benchmark are comparing it to the libraries listed below
- Benchmarking LLMs with Challenging Tasks from Real Users☆218Updated 4 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆131Updated 5 months ago
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆108Updated 9 months ago
- Data preparation code for Amber 7B LLM☆86Updated 10 months ago
- Benchmark baseline for retrieval qa applications☆103Updated 11 months ago
- ☆74Updated last year
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]☆201Updated 2 weeks ago
- Experiments on speculative sampling with Llama models☆125Updated last year
- Just a bunch of benchmark logs for different LLMs☆119Updated 7 months ago
- RepoQA: Evaluating Long-Context Code Understanding☆105Updated 4 months ago
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆92Updated last year
- Self-Reflection in LLM Agents: Effects on Problem-Solving Performance☆61Updated 3 months ago
- Codebase accompanying the Summary of a Haystack paper.☆75Updated 5 months ago
- ☆214Updated 7 months ago
- ☆78Updated 3 weeks ago
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆205Updated 9 months ago
- Astraios: Parameter-Efficient Instruction Tuning Code Language Models☆57Updated 11 months ago
- [ICLR 2024] Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation☆156Updated last year
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆136Updated 4 months ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆114Updated 9 months ago
- Open Implementations of LLM Analyses☆102Updated 5 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- Evaluating LLMs with CommonGen-Lite☆89Updated 11 months ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆130Updated 4 months ago
- A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.☆50Updated 4 months ago
- ToolBench, an evaluation suite for LLM tool manipulation capabilities.☆150Updated last year
- ☆306Updated 9 months ago
- A pipeline for LLM knowledge distillation☆96Updated last month
- "Improving Mathematical Reasoning with Process Supervision" by OPENAI☆107Updated last week
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]☆102Updated 3 weeks ago