leobeeson / llm_benchmarksLinks
A collection of benchmarks and datasets for evaluating LLM.
โ517Updated last year
Alternatives and similar repositories for llm_benchmarks
Users that are interested in llm_benchmarks are comparing it to the libraries listed below
Sorting:
- The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.โ782Updated last year
- A reading list on LLM based Synthetic Data Generation ๐ฅโ1,434Updated 4 months ago
- โ447Updated 2 months ago
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data โฆโ782Updated 7 months ago
- Chat Templates for ๐ค HuggingFace Large Language Modelsโ704Updated 10 months ago
- List of papers on hallucination detection in LLMs.โ974Updated this week
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Modelsโ569Updated last year
- Evaluate your LLM's response with Prometheus and GPT4 ๐ฏโ1,005Updated 5 months ago
- A collection of 150+ surveys on LLMsโ334Updated 8 months ago
- Automatic evals for LLMsโ547Updated 3 months ago
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]โ296Updated 7 months ago
- A curated list of retrieval-augmented generation (RAG) in large language modelsโ321Updated 8 months ago
- Awesome-LLM-Prompt-Optimization: a curated list of advanced prompt optimization and tuning methods in Large Language Modelsโ374Updated last year
- A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt ๆถๅฝๅ็งๅๆ ท็ๆไปคๆฐๆฎ้, ็จไบ่ฎญ็ป ChatLLM ๆจกๅใโ703Updated last year
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.โ1,877Updated 2 months ago
- This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.โ517Updated last year
- Summarize existing representative LLMs text datasets.โ1,365Updated last week
- โ611Updated 2 months ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backendsโ2,009Updated this week
- โ559Updated 2 years ago
- โ544Updated 11 months ago
- [ICML 2024] TrustLLM: Trustworthiness in Large Language Modelsโ600Updated 3 months ago
- Aligning Large Language Models with Human: A Surveyโ735Updated 2 years ago
- Official repository for ORPOโ463Updated last year
- Representation Engineering: A Top-Down Approach to AI Transparencyโ897Updated last year
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"โ680Updated 3 months ago
- Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMsโ294Updated last year
- FuseAI Projectโ584Updated 8 months ago
- LongBench v2 and LongBench (ACL 25'&24')โ992Updated 9 months ago
- Measuring Massive Multitask Language Understanding | ICLR 2021โ1,505Updated 2 years ago