allenai / WildBench
Benchmarking LLMs with Challenging Tasks from Real Users
☆182Updated last month
Related projects: ⓘ
- Self-Alignment with Principle-Following Reward Models☆144Updated 6 months ago
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning"☆87Updated 2 months ago
- Official code for "MAmmoTH2: Scaling Instructions from the Web"☆106Updated this week
- Official implementation for the paper "LongEmbed: Extending Embedding Models for Long Context Retrieval"☆108Updated 4 months ago
- The official evaluation suite and dynamic data release for MixEval.☆200Updated this week
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆195Updated 3 months ago
- A simple unified framework for evaluating LLMs☆121Updated this week
- ☆105Updated this week
- Official github repo for the paper "Compression Represents Intelligence Linearly"☆121Updated 3 months ago
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆123Updated 6 months ago
- Expert Specialized Fine-Tuning☆129Updated last month
- ☆284Updated 3 months ago
- Code and example data for the paper: Rule Based Rewards for Language Model Safety☆131Updated 2 months ago
- Code for In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering☆130Updated 2 months ago
- Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.☆290Updated 5 months ago
- A pipeline to improve skills of large language models☆149Updated this week
- Evaluating LLMs with fewer examples☆131Updated 5 months ago
- Reformatted Alignment☆111Updated 4 months ago
- open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality☆135Updated last month
- ☆77Updated 3 weeks ago
- LOFT: A 1 Million+ Token Long-Context Benchmark☆127Updated 2 weeks ago
- BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.☆139Updated 3 weeks ago
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models☆148Updated 6 months ago
- This repository provides an original implementation of Detecting Pretraining Data from Large Language Models by *Weijia Shi, *Anirudh Aji…☆198Updated 10 months ago
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following☆104Updated 2 months ago
- ☆118Updated 5 months ago
- ☆87Updated 3 months ago
- RewardBench: the first evaluation tool for reward models.☆352Updated last week
- This is the official repository for Inheritune.☆89Updated 4 months ago
- ☆239Updated 10 months ago