leobeeson / llm_benchmarksLinks
A collection of benchmarks and datasets for evaluating LLM.
โ486Updated last year
Alternatives and similar repositories for llm_benchmarks
Users that are interested in llm_benchmarks are comparing it to the libraries listed below
Sorting:
- Chat Templates for ๐ค HuggingFace Large Language Modelsโ690Updated 7 months ago
- Automatic evals for LLMsโ496Updated last month
- A reading list on LLM based Synthetic Data Generation ๐ฅโ1,379Updated 2 months ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backendsโ1,793Updated this week
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data โฆโ744Updated 4 months ago
- The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.โ781Updated last year
- โ400Updated 2 weeks ago
- Evaluate your LLM's response with Prometheus and GPT4 ๐ฏโ978Updated 3 months ago
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuningโ360Updated 11 months ago
- A curated list of retrieval-augmented generation (RAG) in large language modelsโ295Updated 5 months ago
- List of papers on hallucination detection in LLMs.โ930Updated last month
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]โ265Updated 5 months ago
- โ594Updated last week
- [ICML 2024] TrustLLM: Trustworthiness in Large Language Modelsโ586Updated last month
- A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt ๆถๅฝๅ็งๅๆ ท็ๆไปคๆฐๆฎ้, ็จไบ่ฎญ็ป ChatLLM ๆจกๅใโ688Updated last year
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Modelsโ549Updated last year
- Generative Representational Instruction Tuningโ664Updated last month
- Codebase for reproducing the experiments of the semantic uncertainty paper (short-phrase and sentence-length experiments).โ349Updated last year
- Official repository for ORPOโ462Updated last year
- This is an implementation of the paper: Searching for Best Practices in Retrieval-Augmented Generation (EMNLP2024)โ329Updated 7 months ago
- โ529Updated 8 months ago
- Arena-Hard-Auto: An automatic LLM benchmark.โ889Updated last month
- GPQA: A Graduate-Level Google-Proof Q&A Benchmarkโ378Updated 10 months ago
- FuseAI Projectโ579Updated 6 months ago
- Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".โ629Updated this week
- A collection of 150+ surveys on LLMsโ321Updated 5 months ago
- Automatically evaluate your LLMs in Google Colabโ649Updated last year
- This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.โ497Updated last year
- Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"โ504Updated 6 months ago
- Awesome-LLM-Prompt-Optimization: a curated list of advanced prompt optimization and tuning methods in Large Language Modelsโ363Updated last year