terryyz / llm-benchmarkLinks
A list of LLM benchmark frameworks.
☆67Updated last year
Alternatives and similar repositories for llm-benchmark
Users that are interested in llm-benchmark are comparing it to the libraries listed below
Sorting:
- Benchmarking LLMs with Challenging Tasks from Real Users☆226Updated 7 months ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆137Updated 7 months ago
- Codebase accompanying the Summary of a Haystack paper.☆78Updated 9 months ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆117Updated last year
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]☆105Updated 4 months ago
- Spherical Merge Pytorch/HF format Language Models with minimal feature loss.☆129Updated last year
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆145Updated 8 months ago
- The first dense retrieval model that can be prompted like an LM☆73Updated last month
- Benchmark baseline for retrieval qa applications☆115Updated last year
- Evaluating LLMs with CommonGen-Lite☆90Updated last year
- Official repo of Respond-and-Respond: data, code, and evaluation☆104Updated 10 months ago
- Just a bunch of benchmark logs for different LLMs☆119Updated 10 months ago
- Official implementation for 'Extending LLMs’ Context Window with 100 Samples'☆78Updated last year
- [ICLR 2024] Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation☆171Updated last year
- Small and Efficient Mathematical Reasoning LLMs☆71Updated last year
- ☆76Updated last year
- This is the repo for the paper Shepherd -- A Critic for Language Model Generation☆219Updated last year
- ☆126Updated last year
- [EMNLP 2023 Industry Track] A simple prompting approach that enables the LLMs to run inference in batches.☆74Updated last year
- RepoQA: Evaluating Long-Context Code Understanding☆109Updated 7 months ago
- Astraios: Parameter-Efficient Instruction Tuning Code Language Models☆58Updated last year
- Mixing Language Models with Self-Verification and Meta-Verification☆104Updated 6 months ago
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]☆254Updated 3 months ago
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆124Updated last year
- ☆84Updated last year
- Evaluating LLMs with fewer examples☆158Updated last year
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆95Updated last year
- ☆95Updated 8 months ago
- A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.☆85Updated last year
- ☆117Updated 3 months ago