allenai / WildBench
Benchmarking LLMs with Challenging Tasks from Real Users
☆206Updated 2 months ago
Alternatives and similar repositories for WildBench:
Users that are interested in WildBench are comparing it to the libraries listed below
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆129Updated 2 months ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆126Updated 2 months ago
- The official evaluation suite and dynamic data release for MixEval.☆233Updated 2 months ago
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning"☆97Updated 6 months ago
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆134Updated last month
- A simple unified framework for evaluating LLMs☆164Updated 3 weeks ago
- ☆135Updated 3 months ago
- Self-Alignment with Principle-Following Reward Models☆150Updated 10 months ago
- Code and example data for the paper: Rule Based Rewards for Language Model Safety☆174Updated 5 months ago
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆204Updated 7 months ago
- LOFT: A 1 Million+ Token Long-Context Benchmark☆164Updated 2 months ago
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"☆145Updated last month
- Reformatted Alignment☆113Updated 3 months ago
- Reproducible, flexible LLM evaluations☆118Updated last month
- [EMNLP 2023] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning☆221Updated last year
- ☆303Updated 7 months ago
- Evaluating LLMs with fewer examples☆141Updated 9 months ago
- Code for In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering☆154Updated 3 months ago
- ☆119Updated last month
- ☆120Updated 7 months ago
- [EMNLP 2024 (Oral)] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA☆107Updated 2 months ago
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following☆120Updated 6 months ago
- Generative Judge for Evaluating Alignment☆223Updated last year
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆139Updated 3 months ago
- open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality☆170Updated 5 months ago
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆153Updated last month
- [NeurIPS 2024] Knowledge Circuits in Pretrained Transformers☆120Updated last month
- ☆137Updated 9 months ago
- A dataset of LLM-generated chain-of-thought steps annotated with mistake location.☆77Updated 5 months ago
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models☆171Updated 3 months ago