allenai / fluid-benchmarkingLinks
Fluid Language Model Benchmarking
☆25Updated 4 months ago
Alternatives and similar repositories for fluid-benchmarking
Users that are interested in fluid-benchmarking are comparing it to the libraries listed below
Sorting:
- ☆45Updated 7 months ago
- Repository for the paper Stream of Search: Learning to Search in Language☆152Updated 11 months ago
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment☆61Updated last year
- ☆85Updated this week
- Can Language Models Solve Olympiad Programming?☆123Updated last year
- ☆91Updated last year
- [ICLR 2026] RPG: KL-Regularized Policy Gradient (https://arxiv.org/abs/2505.17508)☆64Updated this week
- ☆33Updated last year
- Universal Reasoning Model☆121Updated 2 weeks ago
- ☆75Updated last year
- Code and Configs for Asynchronous RLHF: Faster and More Efficient RL for Language Models☆68Updated 9 months ago
- The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in languag…☆127Updated 3 months ago
- Official repo for Learning to Reason for Long-Form Story Generation☆74Updated 9 months ago
- PostTrainBench measures how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours☆118Updated last week
- Q-Probe: A Lightweight Approach to Reward Maximization for Language Models☆40Updated last year
- Official repository for "BLEUBERI: BLEU is a surprisingly effective reward for instruction following"☆31Updated 7 months ago
- ☆123Updated 11 months ago
- Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Fl…☆78Updated last year
- ☆27Updated 4 months ago
- [NeurIPS 2024] Goldfish Loss: Mitigating Memorization in Generative LLMs☆94Updated last year
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"☆88Updated last year
- Functional Benchmarks and the Reasoning Gap☆89Updated last year
- ☆152Updated 4 months ago
- Replicating O1 inference-time scaling laws☆91Updated last year
- Evaluation of LLMs on latest math competitions☆213Updated last month
- Language models scale reliably with over-training and on downstream tasks☆99Updated last year
- The official repository for SkyLadder: Better and Faster Pretraining via Context Window Scheduling☆42Updated last month
- A repository for research on medium sized language models.☆77Updated last year
- Official Repo for InSTA: Towards Internet-Scale Training For Agents☆55Updated 6 months ago
- [ICLR 2026] Official PyTorch Implementation of RLP: Reinforcement as a Pretraining Objective☆226Updated this week