WildEval / ZeroEval
A simple unified framework for evaluating LLMs
☆209Updated last week
Alternatives and similar repositories for ZeroEval:
Users that are interested in ZeroEval are comparing it to the libraries listed below
- Benchmarking LLMs with Challenging Tasks from Real Users☆220Updated 5 months ago
- ☆166Updated this week
- The official evaluation suite and dynamic data release for MixEval.☆235Updated 5 months ago
- Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate"☆137Updated 2 months ago
- 🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.☆325Updated this week
- The HELMET Benchmark☆135Updated last week
- ☆114Updated 2 months ago
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆168Updated last month
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆187Updated 4 months ago
- Reproducible, flexible LLM evaluations☆191Updated 3 weeks ago
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆205Updated 11 months ago
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆181Updated this week
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆177Updated last week
- ☆96Updated 9 months ago
- ☆308Updated 10 months ago
- Evaluating LLMs with fewer examples☆151Updated last year
- ☆282Updated last month
- SWE Arena☆31Updated last week
- LOFT: A 1 Million+ Token Long-Context Benchmark☆187Updated 2 weeks ago
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆196Updated 2 weeks ago
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning☆190Updated last month
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"☆236Updated this week
- Code and example data for the paper: Rule Based Rewards for Language Model Safety☆186Updated 9 months ago
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]☆105Updated 2 months ago
- open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality☆185Updated 8 months ago
- "Improving Mathematical Reasoning with Process Supervision" by OPENAI☆108Updated 2 weeks ago
- ☆70Updated 5 months ago
- RewardBench: the first evaluation tool for reward models.☆555Updated last month
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆133Updated 5 months ago
- EvaByte: Efficient Byte-level Language Models at Scale☆87Updated last month