Repository for "Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators"
☆12Mar 25, 2025Updated 11 months ago
Alternatives and similar repositories for scaling-evaluation-compute
Users that are interested in scaling-evaluation-compute are comparing it to the libraries listed below
Sorting:
- [ACL 2025 Main] Official Repository for "Evaluating Language Models as Synthetic Data Generators"☆41Dec 13, 2024Updated last year
- Official implementation for "MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models"☆18Oct 26, 2024Updated last year
- Official code and dataset for our NAACL 2024 paper: DialogCC: An Automated Pipeline for Creating High-Quality Multi-modal Dialogue Datase…☆13Jun 24, 2024Updated last year
- [NeurIPS 2025] Reasoning Models Better Express Their Confidence"☆22Nov 19, 2025Updated 3 months ago
- Dataset and Evaluation Code for the K-QA Benchmark.☆18May 26, 2024Updated last year
- Official code and dataset for our EMNLP 2024 Findings paper: Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Kn…☆19Dec 27, 2024Updated last year
- [NeurIPS 2024] Train LLMs with diverse system messages reflecting individualized preferences to generalize to unseen system messages☆53Aug 10, 2025Updated 6 months ago
- Evaluating Multimodal Generative AI with Korean Educational Standards, NAACL 2025.☆25May 15, 2025Updated 9 months ago
- ☆24Dec 2, 2023Updated 2 years ago
- CliniQG4QA: Generating Diverse Questions for Domain Adaptation of Clinical Question Answering☆23Feb 26, 2021Updated 5 years ago
- Implementation of the MATRIX framework (ICML 2024)☆60May 6, 2024Updated last year
- The most modern LLM evaluation toolkit☆70Nov 9, 2025Updated 3 months ago
- ☆37Jan 26, 2025Updated last year
- 자체 구축한 한국어 평가 데이터셋을 이용한 한국어 모델 평가☆31May 31, 2024Updated last year
- ☆33Aug 30, 2023Updated 2 years ago
- huggingface에 있는 한국어 데이터 세트☆36Oct 10, 2024Updated last year
- ☆13Jan 12, 2023Updated 3 years ago
- Pytorch Code and Data for EnvEdit: Environment Editing for Vision-and-Language Navigation (CVPR 2022)☆30Aug 2, 2022Updated 3 years ago
- ☆10Nov 7, 2022Updated 3 years ago
- ☆11May 18, 2022Updated 3 years ago
- Python package for Geometric / Clifford Algebra with Pytorch.☆14Jan 25, 2026Updated last month
- Detect-Then-Explain Framework for Text-to-SQL task☆10Dec 6, 2023Updated 2 years ago
- Repo for our work "Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence"☆19Jun 2, 2025Updated 9 months ago
- PyTorch Implementation for the paper "Let Me Help You! Neuro-Symbolic Short-Context Action Anticipation" accepted to RA-L'24.☆12Nov 27, 2024Updated last year
- ☆13Nov 15, 2017Updated 8 years ago
- A simple repository showcasing a few LLM Evaluation strategies and leverages W&B Sweeps to optimize the LLM system.☆12Jul 11, 2023Updated 2 years ago
- ☆10Oct 11, 2022Updated 3 years ago
- PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions (NeurIPS 2025 D&B track, Spotlight)☆23Feb 11, 2026Updated 2 weeks ago
- [ACL 2024] LangBridge: Multilingual Reasoning Without Multilingual Supervision☆96Oct 30, 2024Updated last year
- [NAACL 2024] Vision language model that reduces hallucinations through self-feedback guided revision. Visualizes attentions on image feat…☆47Aug 21, 2024Updated last year
- [CVPR 2023] Official PyTorch Implementation for "Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust N…☆45Jul 18, 2023Updated 2 years ago
- NeurIPS 2024 tutorial on LLM Inference☆49Dec 10, 2024Updated last year
- ☆47Apr 9, 2025Updated 10 months ago
- This repo is for Korean wiki table question answering datasets described in the paper of Korean-Specific Dataset for Table Question Answe…☆91Oct 22, 2024Updated last year
- Lipschitz Lifelong RL☆11Nov 6, 2020Updated 5 years ago
- HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models☆13Mar 6, 2025Updated 11 months ago
- A brief tutorial for eBPF: Verifier, observability, networking, and security.☆12Sep 19, 2024Updated last year
- DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery☆20Sep 24, 2025Updated 5 months ago
- ☆12Jun 16, 2023Updated 2 years ago