allenai / WildBenchLinks

Benchmarking LLMs with Challenging Tasks from Real Users

☆233

Alternatives and similar repositories for WildBench

Users that are interested in WildBench are comparing it to the libraries listed below

Sorting:

TIGER-AI-Lab / MAmmoTH2
Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]
☆146Updated 9 months ago
JinjieNi / MixEval
The official evaluation suite and dynamic data release for MixEval.
☆242Updated 8 months ago
WildEval / ZeroEval
A simple unified framework for evaluating LLMs
☆235Updated 3 months ago
TIGER-AI-Lab / LongICLBench
Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]
☆105Updated 5 months ago
dwzhu-pku / LongEmbed
LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)
☆140Updated 8 months ago
google-deepmind / loft
LOFT: A 1 Million+ Token Long-Context Benchmark
☆207Updated last month
allenai / olmes
Reproducible, flexible LLM evaluations
☆226Updated 3 weeks ago
Re-Align / URIAL
☆311Updated last year
dwzhu-pku / PoSE
Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)
☆205Updated last year
da03 / implicit_chain_of_thought
☆135Updated 8 months ago
IBM / SALMON
Self-Alignment with Principle-Following Reward Models
☆162Updated 2 months ago
ScalerLab / JudgeBench
☆91Updated 8 months ago
TIGER-AI-Lab / CritiqueFineTuning
Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate" [COLM 2025]
☆169Updated 3 weeks ago
openai / safety-rbr-code-and-data
Code and example data for the paper: Rule Based Rewards for Language Model Safety
☆190Updated last year
GAIR-NLP / AIME-Preview
☆71Updated 4 months ago
QwenLM / ProcessBench
Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"
☆166Updated 2 months ago
da03 / Internalize_CoT_Step_by_Step
☆187Updated 3 months ago
princeton-nlp / HELMET
The HELMET Benchmark
☆161Updated 3 months ago
SALT-NLP / demonstrated-feedback
☆125Updated 10 months ago
p-lambda / dsir
DSIR large-scale data selection framework for language model training
☆257Updated last year
princeton-nlp / LLMBar
[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following
☆127Updated last year
facebookresearch / Shepherd
This is the repo for the paper Shepherd -- A Critic for Language Model Generation
☆219Updated last year
QingruZhang / PASTA
PASTA: Post-hoc Attention Steering for LLMs
☆122Updated 8 months ago
TIGER-AI-Lab / General-Reasoner
General Reasoner: Advancing LLM Reasoning Across All Domains
☆156Updated last month
kyegomez / Lets-Verify-Step-by-Step
"Improving Mathematical Reasoning with Process Supervision" by OPENAI
☆112Updated 2 weeks ago
GAIR-NLP / scaleeval
Scalable Meta-Evaluation of LLMs as Evaluators
☆42Updated last year
zankner / CLoud
Critique-out-Loud Reward Models
☆70Updated 9 months ago
architsharma97 / dpo-rlaif
☆99Updated last year
princeton-nlp / AutoCompressors
[EMNLP 2023] Adapting Language Models to Compress Long Contexts
☆309Updated 10 months ago
alonj / Same-Task-More-Tokens
The code for the paper: "Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models"
☆54Updated last year