A collection of benchmarks and datasets for evaluating LLM.
☆569Jul 13, 2024Updated last year
Alternatives and similar repositories for llm_benchmarks
Users that are interested in llm_benchmarks are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Measuring Massive Multitask Language Understanding | ICLR 2021☆1,579May 28, 2023Updated 2 years ago
- ☆13May 21, 2024Updated 2 years ago
- A framework for few-shot evaluation of language models.☆12,678May 11, 2026Updated 2 weeks ago
- Multi-dimensional analysis of orthogonal safety directions in LLM alignment☆22Mar 20, 2025Updated last year
- Multi-scale Anomaly Detection on Attributed Networks☆14Dec 3, 2021Updated 4 years ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- [ICML 2024] Selecting High-Quality Data for Training Language Models☆202Dec 8, 2025Updated 5 months ago
- ☆15Mar 26, 2024Updated 2 years ago
- Exploration of automated dataset selection approaches at large scales.☆54Mar 4, 2025Updated last year
- ☆26May 28, 2025Updated 11 months ago
- AllenAI's post-training codebase☆3,729Updated this week
- Mutual Information Predicts Hallucinations in Abstractive Summarization☆13Nov 14, 2022Updated 3 years ago
- To assess the longtext capabilities more comprehensively, we propose Needle-in-a-Haystack PLUS, which shifts the focus from simple fact r…☆13Mar 4, 2024Updated 2 years ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆2,289Aug 17, 2024Updated last year
- ☆19Aug 19, 2025Updated 9 months ago
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- Awesome-LLM-Benchmark: List of benchmarks for Large-Language Models☆11Apr 17, 2026Updated last month
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,985Aug 9, 2025Updated 9 months ago
- Sparse Autoencoder for Mechanistic Interpretability☆297Jul 20, 2024Updated last year
- [SIGIR 2024] TRAD: Enhancing LLM Agents with Step-Wise Thought Retrieval and Aligned Decision☆20Mar 28, 2024Updated 2 years ago
- Benchmarking large language models' complex reasoning ability with chain-of-thought prompting☆2,773Aug 4, 2024Updated last year
- Paper list and datasets for the paper: A Survey on Data Selection for LLM Instruction Tuning☆48Jan 22, 2026Updated 4 months ago
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]☆679Jul 29, 2025Updated 9 months ago
- This repository contains hybrid-rag a LLMOPS python package☆26Feb 1, 2025Updated last year
- Official implementation of Panacea: A foundation model for clinical trial design, recruitment, search, and summarization.☆19Dec 24, 2024Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Fantastic Data Engineering for Large Language Models☆92Dec 29, 2024Updated last year
- The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".☆1,600Apr 17, 2026Updated last month
- OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, …☆7,025Updated this week
- PyTorch native post-training library☆5,757Updated this week
- Summarize existing representative LLMs text datasets.☆1,465Mar 11, 2026Updated 2 months ago
- ☆4,492Apr 22, 2026Updated last month
- The LLM Evaluation Framework☆15,681Updated this week
- ☆60Nov 19, 2024Updated last year
- ☆16Apr 2, 2025Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)☆71,468Updated this week
- ☆14Jun 12, 2023Updated 2 years ago
- Curated list of datasets and tools for post-training.☆4,585Apr 29, 2026Updated 3 weeks ago
- ☆37Nov 14, 2025Updated 6 months ago
- Entropy Based Sampling and Parallel CoT Decoding☆17Oct 9, 2024Updated last year
- An Educational Framework Based on PyTorch for Deep Learning Education and Exploration☆11Dec 24, 2023Updated 2 years ago
- 🔬 Interpretability for Leela Chess Zero networks.☆19Apr 29, 2026Updated 3 weeks ago