GAIR-NLP / scaleeval
Scalable Meta-Evaluation of LLMs as Evaluators
☆42Updated last year
Alternatives and similar repositories for scaleeval:
Users that are interested in scaleeval are comparing it to the libraries listed below
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆137Updated 5 months ago
- ☆56Updated last month
- Critique-out-Loud Reward Models☆57Updated 5 months ago
- ☆70Updated 5 months ago
- Evaluate the Quality of Critique☆34Updated 10 months ago
- [ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"☆54Updated last year
- A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models☆46Updated last month
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following☆124Updated 9 months ago
- A dataset of LLM-generated chain-of-thought steps annotated with mistake location.☆80Updated 8 months ago
- ☆59Updated 7 months ago
- Official repository for paper "Weak-to-Strong Extrapolation Expedites Alignment"☆74Updated 10 months ago
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]☆105Updated last month
- [NeurIPS 2024] Train LLMs with diverse system messages reflecting individualized preferences to generalize to unseen system messages☆45Updated 4 months ago
- [arXiv preprint] Official Repository for "Evaluating Language Models as Synthetic Data Generators"☆34Updated 4 months ago
- ☆94Updated 3 weeks ago
- [ICLR 2025] InstructRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales☆81Updated 2 months ago
- [ICLR'24 spotlight] Tool-Augmented Reward Modeling☆47Updated 3 months ago
- B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners☆75Updated 2 weeks ago
- ☆69Updated last year
- ☆45Updated last month
- [NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI☆99Updated last month
- A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.☆85Updated last year
- Source code of "Reasons to Reject? Aligning Language Models with Judgments"☆58Updated last year
- Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering☆57Updated 4 months ago
- Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"☆47Updated last year
- ☆96Updated 9 months ago
- Reformatted Alignment☆115Updated 6 months ago
- LongHeads: Multi-Head Attention is Secretly a Long Context Processor☆29Updated last year
- Code implementation of synthetic continued pretraining☆99Updated 3 months ago
- GSM-Plus: Data, Code, and Evaluation for Enhancing Robust Mathematical Reasoning in Math Word Problems.☆59Updated 9 months ago