leobeeson / llm_benchmarks
A collection of benchmarks and datasets for evaluating LLM.
☆445Updated 10 months ago
Alternatives and similar repositories for llm_benchmarks
Users that are interested in llm_benchmarks are comparing it to the libraries listed below
Sorting:
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆1,516Updated last week
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …☆699Updated last month
- A reading list on LLM based Synthetic Data Generation 🔥☆1,265Updated 2 months ago
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]☆243Updated 2 months ago
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆353Updated 8 months ago
- ☆691Updated 2 weeks ago
- List of papers on hallucination detection in LLMs.☆862Updated last week
- Automatic evals for LLMs☆388Updated this week
- FuseAI Project☆566Updated 3 months ago
- Evaluate your LLM's response with Prometheus and GPT4 💯☆938Updated 3 weeks ago
- Chat Templates for 🤗 HuggingFace Large Language Models☆656Updated 5 months ago
- Automatically evaluate your LLMs in Google Colab☆622Updated last year
- [ICML 2024] TrustLLM: Trustworthiness in Large Language Models☆561Updated 2 months ago
- Large Reasoning Models☆805Updated 5 months ago
- This is an implementation of the paper: Searching for Best Practices in Retrieval-Augmented Generation (EMNLP2024)☆316Updated 4 months ago
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,740Updated 4 months ago
- ☆515Updated 5 months ago
- ☆322Updated last week
- RewardBench: the first evaluation tool for reward models.☆566Updated last week
- awesome synthetic (text) datasets☆281Updated 6 months ago
- ☆931Updated 3 months ago
- A curated list of papers related to constrained decoding of LLM, along with their relevant code and resources.☆198Updated 2 weeks ago
- Summarize existing representative LLMs text datasets.☆1,266Updated last month
- Recipes to scale inference-time compute of open models☆1,071Updated last week
- Code and data for "Lost in the Middle: How Language Models Use Long Contexts"☆343Updated last year
- Representation Engineering: A Top-Down Approach to AI Transparency☆828Updated 9 months ago
- LIMO: Less is More for Reasoning☆940Updated last month
- A series of technical report on Slow Thinking with LLM☆667Updated last month
- Generative Representational Instruction Tuning☆628Updated 2 months ago
- Official repository for ORPO☆452Updated 11 months ago