leobeeson / llm_benchmarksLinks
A collection of benchmarks and datasets for evaluating LLM.
☆525Updated last year
Alternatives and similar repositories for llm_benchmarks
Users that are interested in llm_benchmarks are comparing it to the libraries listed below
Sorting:
- The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.☆785Updated last year
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …☆786Updated 7 months ago
- ☆456Updated 3 months ago
- Chat Templates for 🤗 HuggingFace Large Language Models☆705Updated 11 months ago
- Automatic evals for LLMs☆556Updated 4 months ago
- A reading list on LLM based Synthetic Data Generation 🔥☆1,452Updated 5 months ago
- A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。☆709Updated last year
- List of papers on hallucination detection in LLMs.☆981Updated 2 weeks ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆2,078Updated last week
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,896Updated 3 months ago
- Evaluate your LLM's response with Prometheus and GPT4 💯☆1,009Updated 6 months ago
- Generative Representational Instruction Tuning☆677Updated 4 months ago
- [ICML 2024] TrustLLM: Trustworthiness in Large Language Models☆606Updated 4 months ago
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]☆307Updated 8 months ago
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models☆572Updated last year
- This is an implementation of the paper: Searching for Best Practices in Retrieval-Augmented Generation (EMNLP2024)☆340Updated 10 months ago
- LongBench v2 and LongBench (ACL 25'&24')☆1,008Updated 9 months ago
- A collection of 150+ surveys on LLMs☆337Updated 8 months ago
- Awesome-LLM-Prompt-Optimization: a curated list of advanced prompt optimization and tuning methods in Large Language Models☆381Updated last year
- Summarize existing representative LLMs text datasets.☆1,380Updated last month
- This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.☆523Updated last year
- ☆552Updated 11 months ago
- ☆615Updated 3 months ago
- RewardBench: the first evaluation tool for reward models.☆649Updated 5 months ago
- Official repository for ORPO☆463Updated last year
- Repository for "MultiHop-RAG: A Dataset for Evaluating Retrieval-Augmented Generation Across Documents" (COLM 2024)☆384Updated 7 months ago
- Automatically evaluate your LLMs in Google Colab☆664Updated last year
- A curated list of retrieval-augmented generation (RAG) in large language models☆328Updated 8 months ago
- Arena-Hard-Auto: An automatic LLM benchmark.☆956Updated 4 months ago
- Aligning Large Language Models with Human: A Survey☆738Updated 2 years ago