allenai / super-benchmark
☆41Updated last week
Alternatives and similar repositories for super-benchmark:
Users that are interested in super-benchmark are comparing it to the libraries listed below
- [NeurIPS 2024] Train LLMs with diverse system messages reflecting individualized preferences to generalize to unseen system messages☆45Updated 4 months ago
- ☆21Updated 10 months ago
- Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"☆47Updated last year
- [ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"☆54Updated last year
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆53Updated 4 months ago
- ☆22Updated 4 months ago
- Reference implementation for Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model☆42Updated last year
- Codebase for Context-aware Meta-learned Loss Scaling (CaMeLS). https://arxiv.org/abs/2305.15076.☆25Updated last year
- Evaluate the Quality of Critique☆34Updated 10 months ago
- ☆26Updated 9 months ago
- Exploration of automated dataset selection approaches at large scales.☆37Updated last month
- Benchmarking Benchmark Leakage in Large Language Models☆51Updated 10 months ago
- Training and Benchmarking LLMs for Code Preference.☆33Updated 5 months ago
- [arXiv preprint] Official Repository for "Evaluating Language Models as Synthetic Data Generators"☆34Updated 4 months ago
- ☆27Updated 3 weeks ago
- Replicating O1 inference-time scaling laws☆83Updated 4 months ago
- FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions☆42Updated 9 months ago
- ☆14Updated this week
- B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners☆75Updated 2 weeks ago
- A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models☆46Updated last month
- CodeUltraFeedback: aligning large language models to coding preferences☆71Updated 9 months ago
- ☆27Updated last year
- [𝐄𝐌𝐍𝐋𝐏 𝐅𝐢𝐧𝐝𝐢𝐧𝐠𝐬 𝟐𝟎𝟐𝟒 & 𝐀𝐂𝐋 𝟐𝟎𝟐𝟒 𝐍𝐋𝐑𝐒𝐄 𝐎𝐫𝐚𝐥] 𝘌𝘯𝘩𝘢𝘯𝘤𝘪𝘯𝘨 𝘔𝘢𝘵𝘩𝘦𝘮𝘢𝘵𝘪𝘤𝘢𝘭 𝘙𝘦𝘢𝘴𝘰𝘯𝘪𝘯…☆49Updated 11 months ago
- SILO Language Models code repository☆81Updated last year
- IntructIR, a novel benchmark specifically designed to evaluate the instruction following ability in information retrieval models. Our foc…☆31Updated 10 months ago
- ☆41Updated 8 months ago
- Minimal implementation of the Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models paper (ArXiv 20232401.01335)☆29Updated last year
- Source code for our paper: "Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction A…☆44Updated last year
- This repository contains data, code and models for contextual noncompliance.☆21Updated 8 months ago