leobeeson / llm_benchmarksLinks
A collection of benchmarks and datasets for evaluating LLM.
โ550Updated last year
Alternatives and similar repositories for llm_benchmarks
Users that are interested in llm_benchmarks are comparing it to the libraries listed below
Sorting:
- The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.โ792Updated last year
- Chat Templates for ๐ค HuggingFace Large Language Modelsโ713Updated last year
- List of papers on hallucination detection in LLMs.โ1,041Updated last month
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data โฆโ826Updated 10 months ago
- โ519Updated 6 months ago
- Automatic evals for LLMsโ579Updated last month
- A reading list on LLM based Synthetic Data Generation ๐ฅโ1,516Updated 8 months ago
- Evaluate your LLM's response with Prometheus and GPT4 ๐ฏโ1,043Updated 9 months ago
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Modelsโ601Updated last year
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuningโ366Updated last year
- Arena-Hard-Auto: An automatic LLM benchmark.โ994Updated 7 months ago
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.โ1,940Updated 6 months ago
- โ564Updated last year
- Best practices for distilling large language models.โ604Updated 2 years ago
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]โ335Updated 2 months ago
- A curated list of retrieval-augmented generation (RAG) in large language modelsโ365Updated 2 months ago
- A curated list of papers related to constrained decoding of LLM, along with their relevant code and resources.โ320Updated 2 weeks ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backendsโ2,293Updated 3 weeks ago
- RewardBench: the first evaluation tool for reward models.โ687Updated last week
- Code and data for "Lost in the Middle: How Language Models Use Long Contexts"โ373Updated 2 years ago
- A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt ๆถๅฝๅ็งๅๆ ท็ๆไปคๆฐๆฎ้, ็จไบ่ฎญ็ป ChatLLM ๆจกๅใโ721Updated last year
- This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.โ552Updated last year
- Representation Engineering: A Top-Down Approach to AI Transparencyโ947Updated last year
- Official repository for ORPOโ469Updated last year
- Measuring Massive Multitask Language Understanding | ICLR 2021โ1,550Updated 2 years ago
- An Open Source Toolkit For LLM Distillationโ859Updated last month
- Generative Representational Instruction Tuningโ686Updated 7 months ago
- This is a collection of research papers for Self-Correcting Large Language Models with Automated Feedback.โ565Updated last year
- GPQA: A Graduate-Level Google-Proof Q&A Benchmarkโ466Updated last year
- โ580Updated 2 years ago