leobeeson / llm_benchmarksLinks
A collection of benchmarks and datasets for evaluating LLM.
☆466Updated 11 months ago
Alternatives and similar repositories for llm_benchmarks
Users that are interested in llm_benchmarks are comparing it to the libraries listed below
Sorting:
- A reading list on LLM based Synthetic Data Generation 🔥☆1,310Updated 3 weeks ago
- Chat Templates for 🤗 HuggingFace Large Language Models☆674Updated 6 months ago
- Automatic evals for LLMs☆437Updated 3 weeks ago
- List of papers on hallucination detection in LLMs.☆896Updated last week
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆1,641Updated this week
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …☆717Updated 3 months ago
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]☆254Updated 3 months ago
- Code and data for "Lost in the Middle: How Language Models Use Long Contexts"☆348Updated last year
- A collection of 150+ surveys on LLMs☆301Updated 4 months ago
- The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.☆772Updated last year
- A curated list of Human Preference Datasets for LLM fine-tuning, RLHF, and eval.☆367Updated last year
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,778Updated 6 months ago
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆357Updated 9 months ago
- A curated list of papers related to constrained decoding of LLM, along with their relevant code and resources.☆224Updated 2 weeks ago
- Evaluate your LLM's response with Prometheus and GPT4 💯☆952Updated 2 months ago
- ☆782Updated last month
- Automatically evaluate your LLMs in Google Colab☆643Updated last year
- An Awesome Collection for LLM Survey☆366Updated last month
- LLM hallucination paper list☆318Updated last year
- A framework for the evaluation of autoregressive code generation language models.☆949Updated 7 months ago
- RewardBench: the first evaluation tool for reward models.☆604Updated 2 weeks ago
- ☆363Updated 3 weeks ago
- ☆520Updated 7 months ago
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them☆497Updated last year
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆463Updated last year
- Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"☆501Updated 5 months ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆1,897Updated 10 months ago
- A library for easily merging multiple LLM experts, and efficiently train the merged LLM.☆483Updated 10 months ago
- The official evaluation suite and dynamic data release for MixEval.☆243Updated 7 months ago
- ☆575Updated 3 weeks ago