WildEval / ZeroEvalLinks

A simple unified framework for evaluating LLMs

☆235

Alternatives and similar repositories for ZeroEval

Users that are interested in ZeroEval are comparing it to the libraries listed below

Sorting:

allenai / WildBench
Benchmarking LLMs with Challenging Tasks from Real Users
☆233Updated 9 months ago
allenai / olmes
Reproducible, flexible LLM evaluations
☆226Updated 3 weeks ago
princeton-nlp / HELMET
The HELMET Benchmark
☆161Updated 3 months ago
JinjieNi / MixEval
The official evaluation suite and dynamic data release for MixEval.
☆242Updated 8 months ago
da03 / Internalize_CoT_Step_by_Step
☆187Updated 3 months ago
booydar / babilong
BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.
☆208Updated 2 months ago
google-deepmind / loft
LOFT: A 1 Million+ Token Long-Context Benchmark
☆207Updated last month
OSU-NLP-Group / GrokkedTransformer
Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'
☆225Updated 2 weeks ago
SalesforceAIResearch / LaTRO
☆117Updated 5 months ago
sail-sg / oat
🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning, preference learning, etc.
☆418Updated last week
TIGER-AI-Lab / CritiqueFineTuning
Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate" [COLM 2025]
☆169Updated 3 weeks ago
openai / safety-rbr-code-and-data
Code and example data for the paper: Rule Based Rewards for Language Model Safety
☆190Updated last year
ScalerLab / JudgeBench
☆91Updated 8 months ago
lm-sys / llm-decontaminator
Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
☆306Updated last year
ScalingIntelligence / Archon
Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.
☆175Updated 4 months ago
LeonGuertler / TextArena
A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning
☆225Updated this week
dwzhu-pku / PoSE
Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)
☆205Updated last year
kyegomez / Lets-Verify-Step-by-Step
"Improving Mathematical Reasoning with Process Supervision" by OPENAI
☆112Updated 2 weeks ago
nightdessert / Retrieval_Head
open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality
☆205Updated last year
felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆160Updated last year
TIGER-AI-Lab / LongICLBench
Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]
☆105Updated 5 months ago
architsharma97 / dpo-rlaif
☆99Updated last year
Re-Align / URIAL
☆311Updated last year
da03 / implicit_chain_of_thought
☆135Updated 8 months ago
expz / quiet-star
Implementation of the Quiet-STAR paper (https://arxiv.org/pdf/2403.09629.pdf)
☆54Updated 11 months ago
zorazrw / awesome-tool-llm
☆237Updated 11 months ago
StonyBrookNLP / appworld
🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…
☆232Updated 2 months ago
ryoungj / ObsScaling
[NeurIPS'24 Spotlight] Observational Scaling Laws
☆56Updated 10 months ago
mlfoundations / evalchemy
Automatic evals for LLMs
☆496Updated last month
tianyi-lab / Reflection_Tuning
[ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning
☆359Updated 10 months ago