leobeeson / llm_benchmarksLinks
A collection of benchmarks and datasets for evaluating LLM.
โ511Updated last year
Alternatives and similar repositories for llm_benchmarks
Users that are interested in llm_benchmarks are comparing it to the libraries listed below
Sorting:
- โ432Updated 2 months ago
- A reading list on LLM based Synthetic Data Generation ๐ฅโ1,418Updated 3 months ago
- The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.โ780Updated last year
- Chat Templates for ๐ค HuggingFace Large Language Modelsโ701Updated 9 months ago
- List of papers on hallucination detection in LLMs.โ960Updated 3 months ago
- Automatic evals for LLMsโ528Updated 3 months ago
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data โฆโ772Updated 6 months ago
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]โ292Updated 7 months ago
- โ604Updated last month
- A collection of 150+ surveys on LLMsโ331Updated 7 months ago
- This is an implementation of the paper: Searching for Best Practices in Retrieval-Augmented Generation (EMNLP2024)โ334Updated 9 months ago
- A curated list of retrieval-augmented generation (RAG) in large language modelsโ313Updated 7 months ago
- Evaluate your LLM's response with Prometheus and GPT4 ๐ฏโ989Updated 5 months ago
- A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt ๆถๅฝๅ็งๅๆ ท็ๆไปคๆฐๆฎ้, ็จไบ่ฎญ็ป ChatLLM ๆจกๅใโ700Updated last year
- Arena-Hard-Auto: An automatic LLM benchmark.โ928Updated 3 months ago
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuningโ362Updated last year
- Awesome-LLM-Prompt-Optimization: a curated list of advanced prompt optimization and tuning methods in Large Language Modelsโ370Updated last year
- Automatically evaluate your LLMs in Google Colabโ660Updated last year
- โ541Updated 10 months ago
- Official repository for ORPOโ464Updated last year
- โ559Updated 2 years ago
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Modelsโ566Updated last year
- A curated list of papers related to constrained decoding of LLM, along with their relevant code and resources.โ258Updated last month
- RewardBench: the first evaluation tool for reward models.โ638Updated 3 months ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backendsโ1,962Updated this week
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"โ665Updated 2 months ago
- Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMsโ289Updated last year
- Compress your input to ChatGPT or other LLMs, to let them process 2x more content and save 40% memory and GPU time.โ397Updated last year
- [ICML 2024] TrustLLM: Trustworthiness in Large Language Modelsโ595Updated 3 months ago
- Survey of Small Language Models from Penn State, ...โ200Updated last month