Scalable Meta-Evaluation of LLMs as Evaluators
☆43Feb 15, 2024Updated 2 years ago
Alternatives and similar repositories for scaleeval
Users that are interested in scaleeval are comparing it to the libraries listed below
Sorting:
- ☆13Jul 14, 2024Updated last year
- BeHonest: Benchmarking Honesty in Large Language Models☆34Aug 15, 2024Updated last year
- Evaluate the Quality of Critique☆36Jun 1, 2024Updated last year
- ☆78May 22, 2024Updated last year
- [ACL 2024] Code for "MoPS: Modular Story Premise Synthesis for Open-Ended Automatic Story Generation"☆42Jul 19, 2024Updated last year
- Safety-J: Evaluating Safety with Critique☆16Jul 28, 2024Updated last year
- ☆27Mar 27, 2025Updated 11 months ago
- ☆51Mar 2, 2024Updated last year
- Official implementation for 'Extending LLMs’ Context Window with 100 Samples'☆81Jan 18, 2024Updated 2 years ago
- [NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI☆107Mar 6, 2025Updated 11 months ago
- ☆25May 16, 2024Updated last year
- [AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy☆77Oct 9, 2025Updated 4 months ago
- Align, a general text alignment function☆15Dec 7, 2023Updated 2 years ago
- Multimodal RewardBench☆62Feb 21, 2025Updated last year
- ☆12Sep 23, 2024Updated last year
- Collections of RLxLM experiments using minimal codes☆14Feb 17, 2025Updated last year
- Echo Noise Channel for Exact Mutual Information Calculation☆17Jul 17, 2020Updated 5 years ago
- ☆58Sep 2, 2024Updated last year
- This is the oficial repository for "Safer-Instruct: Aligning Language Models with Automated Preference Data"☆17Feb 22, 2024Updated 2 years ago
- ☆47Jan 31, 2026Updated last month
- Generative Judge for Evaluating Alignment☆250Jan 18, 2024Updated 2 years ago
- Trending projects & awesome papers about data-centric llm studies.☆40May 20, 2025Updated 9 months ago
- Technical Report: Is ChatGPT a Good NLG Evaluator? A Preliminary Study☆43Mar 8, 2023Updated 2 years ago
- Code and data from the paper 'Human Feedback is not Gold Standard'☆20Updated this week
- LLM evaluation.☆16Nov 7, 2023Updated 2 years ago
- ☆17Dec 12, 2020Updated 5 years ago
- Code for "[COLM'25] RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing"☆23Mar 18, 2025Updated 11 months ago
- Just a bunch of benchmark logs for different LLMs☆119Jul 28, 2024Updated last year
- Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme☆147Apr 9, 2025Updated 10 months ago
- [Findings of EMNLP'2024] Unified Active Retrieval for Retrieval Augmented Generation☆23Sep 30, 2024Updated last year
- Contains materials from the facilitation sessions conducted for the ML Bootcamp India (2022) organized by Google DevRel team.☆22Sep 26, 2022Updated 3 years ago
- Evaluation suite for LLMs☆379Jul 11, 2025Updated 7 months ago
- ☆75Sep 1, 2022Updated 3 years ago
- This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".☆89Aug 20, 2021Updated 4 years ago
- Benchmarking Benchmark Leakage in Large Language Models☆59May 20, 2024Updated last year
- EMNLP 2022: Finding Dataset Shortcuts with Grammar Induction https://arxiv.org/abs/2210.11560☆58Feb 28, 2025Updated last year
- [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset☆111May 22, 2025Updated 9 months ago
- [NeurlPS D&B 2024] Generative AI for Math: MathPile☆418Apr 4, 2025Updated 10 months ago
- A framework for pitting LLMs against each other in an evolving library of games ⚔☆35Apr 17, 2025Updated 10 months ago