leobeeson / llm_benchmarksLinks

A collection of benchmarks and datasets for evaluating LLM.

☆486

Alternatives and similar repositories for llm_benchmarks

Users that are interested in llm_benchmarks are comparing it to the libraries listed below

Sorting:

chujiezheng / chat_templates
Chat Templates for 🤗 HuggingFace Large Language Models
☆690Updated 7 months ago
mlfoundations / evalchemy
Automatic evals for LLMs
☆496Updated last month
wasiahmad / Awesome-LLM-Synthetic-Data
A reading list on LLM based Synthetic Data Generation 🔥
☆1,379Updated 2 months ago
huggingface / lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
☆1,793Updated this week
magpie-align / magpie
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …
☆744Updated 4 months ago
tjunlp-lab / Awesome-LLMs-Evaluation-Papers
The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.
☆781Updated last year
llm-as-a-judge / Awesome-LLM-as-a-judge
☆400Updated 2 weeks ago
prometheus-eval / prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯
☆978Updated 3 months ago
tianyi-lab / Reflection_Tuning
[ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning
☆360Updated 11 months ago
coree / awesome-rag
A curated list of retrieval-augmented generation (RAG) in large language models
☆295Updated 5 months ago
EdinburghNLP / awesome-hallucination-detection
List of papers on hallucination detection in LLMs.
☆930Updated last month
TIGER-AI-Lab / MMLU-Pro
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]
☆265Updated 5 months ago
Zhen-Tan-dmml / LLM4Annotation
☆594Updated last week
HowieHwong / TrustLLM
[ICML 2024] TrustLLM: Trustworthiness in Large Language Models
☆586Updated last month
jianzhnie / awesome-instruction-datasets
A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。
☆688Updated last year
potsawee / selfcheckgpt
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
☆549Updated last year
ContextualAI / gritlm
Generative Representational Instruction Tuning
☆664Updated last month
jlko / semantic_uncertainty
Codebase for reproducing the experiments of the semantic uncertainty paper (short-phrase and sentence-length experiments).
☆349Updated last year
xfactlab / orpo
Official repository for ORPO
☆462Updated last year
FudanDNN-NLP / RAG
This is an implementation of the paper: Searching for Best Practices in Retrieval-Augmented Generation (EMNLP2024)
☆329Updated 7 months ago
huggingface / cosmopedia
☆529Updated 8 months ago
lmarena / arena-hard-auto
Arena-Hard-Auto: An automatic LLM benchmark.
☆889Updated last month
idavidrein / gpqa
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
☆378Updated 10 months ago
fanqiwan / FuseAI
FuseAI Project
☆579Updated 6 months ago
google-deepmind / long-form-factuality
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
☆629Updated this week
NiuTrans / ABigSurveyOfLLMs
A collection of 150+ surveys on LLMs
☆321Updated 5 months ago
mlabonne / llm-autoeval
Automatically evaluate your LLMs in Google Colab
☆649Updated last year
RUCAIBox / HaluEval
This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.
☆497Updated last year
voidism / DoLa
Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"
☆504Updated 6 months ago
jxzhangjhu / Awesome-LLM-Prompt-Optimization
Awesome-LLM-Prompt-Optimization: a curated list of advanced prompt optimization and tuning methods in Large Language Models
☆363Updated last year