leobeeson / llm_benchmarks
A collection of benchmarks and datasets for evaluating LLM.
☆415Updated 8 months ago
Alternatives and similar repositories for llm_benchmarks:
Users that are interested in llm_benchmarks are comparing it to the libraries listed below
- The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.☆743Updated 10 months ago
- LongBench v2 and LongBench (ACL 2024)☆819Updated 2 months ago
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …☆660Updated last week
- A curated list of retrieval-augmented generation (RAG) in large language models☆249Updated last month
- ☆277Updated 3 weeks ago
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆454Updated last year
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]☆206Updated last month
- ☆559Updated 2 weeks ago
- Automatic evals for LLMs☆340Updated last week
- Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718☆313Updated 6 months ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆1,336Updated this week
- A collection of 150+ surveys on LLMs☆270Updated last month
- A reading list on LLM based Synthetic Data Generation 🔥☆1,221Updated last month
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆349Updated 6 months ago
- RewardBench: the first evaluation tool for reward models.☆532Updated last month
- This is the repository for the Tool Learning survey.☆332Updated 3 weeks ago
- ☆910Updated 2 months ago
- This is an implementation of the paper: Searching for Best Practices in Retrieval-Augmented Generation (EMNLP2024)☆300Updated 3 months ago
- ☆316Updated 6 months ago
- [EMNLP 2024: Demo Oral] RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation☆293Updated 5 months ago
- A series of technical report on Slow Thinking with LLM☆595Updated last week
- MAD: The first work to explore Multi-Agent Debate with Large Language Models :D☆347Updated 2 months ago
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆652Updated 2 months ago
- Code and data for "Lost in the Middle: How Language Models Use Long Contexts"☆334Updated last year
- A curated collection of LLM reasoning and planning resources, including key papers, limitations, benchmarks, and additional learning mate…☆249Updated last month
- Generative Representational Instruction Tuning☆612Updated 2 weeks ago
- Chat Templates for 🤗 HuggingFace Large Language Models☆635Updated 3 months ago
- [NeurIPS 2024 Spotlight] Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models☆613Updated this week
- Official repository for ORPO☆446Updated 9 months ago
- Official repo for "LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs".☆227Updated 7 months ago