allenai / WildBench
Benchmarking LLMs with Challenging Tasks from Real Users
☆221Updated 6 months ago
Alternatives and similar repositories for WildBench:
Users that are interested in WildBench are comparing it to the libraries listed below
- The official evaluation suite and dynamic data release for MixEval.☆238Updated 5 months ago
- ☆150Updated 4 months ago
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆139Updated 6 months ago
- Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]☆133Updated 7 months ago
- Self-Alignment with Principle-Following Reward Models☆160Updated last year
- Reproducible, flexible LLM evaluations☆197Updated last month
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆189Updated 5 months ago
- A simple unified framework for evaluating LLMs☆209Updated 3 weeks ago
- Code and example data for the paper: Rule Based Rewards for Language Model Safety☆186Updated 9 months ago
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"☆177Updated last month
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]☆105Updated 2 months ago
- ☆120Updated 7 months ago
- LOFT: A 1 Million+ Token Long-Context Benchmark☆192Updated 2 weeks ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆134Updated 5 months ago
- Evaluating LLMs with fewer examples☆151Updated last year
- ☆127Updated 5 months ago
- ☆97Updated 10 months ago
- Critique-out-Loud Reward Models☆63Updated 6 months ago
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆196Updated last week
- ☆309Updated 10 months ago
- ☆170Updated 2 weeks ago
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆459Updated last year
- RewardBench: the first evaluation tool for reward models.☆562Updated 2 months ago
- The HELMET Benchmark☆142Updated 2 weeks ago
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling☆101Updated 3 months ago
- 🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.☆336Updated 2 weeks ago
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆143Updated 7 months ago
- [EMNLP 2023] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning☆240Updated last year
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆186Updated 3 weeks ago
- ☆62Updated last month