terryyz / llm-benchmark
A list of LLM benchmark frameworks.
☆66Updated last year
Alternatives and similar repositories for llm-benchmark:
Users that are interested in llm-benchmark are comparing it to the libraries listed below
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆133Updated 5 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆135Updated 6 months ago
- Astraios: Parameter-Efficient Instruction Tuning Code Language Models☆57Updated last year
- Codebase accompanying the Summary of a Haystack paper.☆77Updated 6 months ago
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆137Updated 5 months ago
- Lightweight demos for finetuning LLMs. Powered by 🤗 transformers and open-source datasets.☆73Updated 5 months ago
- Evaluating LLMs with fewer examples☆150Updated last year
- RepoQA: Evaluating Long-Context Code Understanding☆107Updated 5 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- Benchmarking LLMs with Challenging Tasks from Real Users☆221Updated 5 months ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆116Updated 10 months ago
- Official implementation for 'Extending LLMs’ Context Window with 100 Samples'☆76Updated last year
- Self-Reflection in LLM Agents: Effects on Problem-Solving Performance☆66Updated 4 months ago
- ☆74Updated last year
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆106Updated 7 months ago
- Repository for analysis and experiments in the BigCode project.☆117Updated last year
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]☆105Updated last month
- Experiments on speculative sampling with Llama models☆125Updated last year
- Benchmark baseline for retrieval qa applications☆108Updated last year
- Evaluating LLMs with CommonGen-Lite☆89Updated last year
- ToolBench, an evaluation suite for LLM tool manipulation capabilities.☆150Updated last year
- Just a bunch of benchmark logs for different LLMs☆119Updated 8 months ago
- ☆84Updated last year
- Small and Efficient Mathematical Reasoning LLMs☆71Updated last year
- ☆44Updated 10 months ago
- 🚢 Data Toolkit for Sailor Language Models☆88Updated last month
- ☆84Updated last year
- A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.☆52Updated 5 months ago
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆116Updated 10 months ago
- Code accompanying "How I learned to start worrying about prompt formatting".☆105Updated 6 months ago