terryyz / llm-benchmarkLinks
A list of LLM benchmark frameworks.
☆68Updated last year
Alternatives and similar repositories for llm-benchmark
Users that are interested in llm-benchmark are comparing it to the libraries listed below
Sorting:
- Open Implementations of LLM Analyses☆105Updated 10 months ago
- [ICLR 2024] Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation☆174Updated last year
- Benchmark baseline for retrieval qa applications☆115Updated last year
- RepoQA: Evaluating Long-Context Code Understanding☆113Updated 9 months ago
- ☆95Updated 10 months ago
- Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models☆97Updated last year
- Codebase accompanying the Summary of a Haystack paper.☆79Updated 10 months ago
- 🔧 Compare how Agent systems perform on several benchmarks. 📊🚀☆99Updated last week
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆132Updated last year
- Verifiers for LLM Reinforcement Learning☆69Updated 3 months ago
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆115Updated 11 months ago
- awesome llm plaza: daily tracking all sorts of awesome topics of llm, e.g. llm for coding, robotics, reasoning, multimod etc.☆206Updated 2 weeks ago
- Evaluating LLMs with CommonGen-Lite☆90Updated last year
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆152Updated 10 months ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆234Updated 9 months ago
- ☆109Updated 3 months ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆139Updated 9 months ago
- [ACL25' Findings] SWE-Dev is an SWE agent with a scalable test case construction pipeline.☆53Updated 3 weeks ago
- Astraios: Parameter-Efficient Instruction Tuning Code Language Models☆59Updated last year
- r2e: turn any github repository into a programming agent environment☆129Updated 3 months ago
- The official evaluation suite and dynamic data release for MixEval.☆242Updated 9 months ago
- Evaluating LLMs with fewer examples☆160Updated last year
- Official implementation of paper "On the Diagram of Thought" (https://arxiv.org/abs/2409.10038)☆184Updated 4 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- ☆77Updated last year
- Data preparation code for Amber 7B LLM☆91Updated last year
- Model, Code & Data for the EMNLP'23 paper "Making Large Language Models Better Data Creators"☆135Updated last year
- [NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?☆129Updated 11 months ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆118Updated last year
- Official repo of Respond-and-Respond: data, code, and evaluation☆103Updated last year