allenai / WildBench
Benchmarking LLMs with Challenging Tasks from Real Users
☆195Updated 2 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for WildBench
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning"☆91Updated 4 months ago
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆124Updated 3 weeks ago
- Self-Alignment with Principle-Following Reward Models☆148Updated 8 months ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆115Updated last week
- ☆102Updated last month
- ☆116Updated 5 months ago
- The official evaluation suite and dynamic data release for MixEval.☆224Updated last week
- This is the official repository for Inheritune.☆105Updated last month
- ☆112Updated last month
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆129Updated 2 months ago
- [EMNLP 2024 (Oral)] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA☆92Updated last week
- A simple unified framework for evaluating LLMs☆145Updated last week
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following☆117Updated 4 months ago
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆110Updated 3 weeks ago
- Reformatted Alignment☆112Updated last month
- LOFT: A 1 Million+ Token Long-Context Benchmark☆146Updated 3 weeks ago
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆199Updated 6 months ago
- Official implementation for 'Extending LLMs’ Context Window with 100 Samples'☆74Updated 10 months ago
- Code accompanying "How I learned to start worrying about prompt formatting".☆95Updated last month
- [EMNLP 2023] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning☆213Updated last year
- ☆295Updated 5 months ago
- Data and code for our paper "Why Does the Effective Context Length of LLMs Fall Short?"☆64Updated last week
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models☆167Updated last month
- This is the repo for the paper Shepherd -- A Critic for Language Model Generation☆213Updated last year
- ☆247Updated last year
- PASTA: Post-hoc Attention Steering for LLMs☆108Updated 2 months ago
- Generative Judge for Evaluating Alignment☆217Updated 10 months ago
- RewardBench: the first evaluation tool for reward models.☆431Updated 3 weeks ago
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"☆118Updated 3 weeks ago
- [ACL 2024] LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement☆156Updated 7 months ago