terryyz / llm-benchmarkLinks
A list of LLM benchmark frameworks.
☆68Updated last year
Alternatives and similar repositories for llm-benchmark
Users that are interested in llm-benchmark are comparing it to the libraries listed below
Sorting:
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆128Updated last year
- Codebase accompanying the Summary of a Haystack paper.☆79Updated 9 months ago
- [ICLR 2024] Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation☆171Updated last year
- Astraios: Parameter-Efficient Instruction Tuning Code Language Models☆58Updated last year
- ☆84Updated last year
- 🔧 Compare how Agent systems perform on several benchmarks. 📊🚀☆98Updated 8 months ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆138Updated 8 months ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆117Updated last year
- RepoQA: Evaluating Long-Context Code Understanding☆111Updated 8 months ago
- ☆104Updated 2 months ago
- Open Implementations of LLM Analyses☆105Updated 9 months ago
- The official evaluation suite and dynamic data release for MixEval.☆242Updated 8 months ago
- Official repo of Respond-and-Respond: data, code, and evaluation☆103Updated 11 months ago
- Evaluating LLMs with CommonGen-Lite☆90Updated last year
- ☆95Updated 9 months ago
- Self-Reflection in LLM Agents: Effects on Problem-Solving Performance☆77Updated 7 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆149Updated 9 months ago
- Spherical Merge Pytorch/HF format Language Models with minimal feature loss.☆132Updated last year
- Small and Efficient Mathematical Reasoning LLMs☆71Updated last year
- Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models☆97Updated last year
- Code for EMNLP 2024 paper "Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning"☆55Updated 9 months ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆229Updated 8 months ago
- FuseAI Project☆87Updated 5 months ago
- ☆149Updated last year
- evol augment any dataset online☆59Updated last year
- Evaluating LLMs with fewer examples☆160Updated last year
- ☆78Updated last year
- Data preparation code for Amber 7B LLM☆91Updated last year
- "Improving Mathematical Reasoning with Process Supervision" by OPENAI☆111Updated this week
- [NeurIPS 2023] This is the code for the paper `Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias`.☆150Updated last year