A package dedicated for running benchmark agreement testing
☆17Sep 18, 2025Updated 6 months ago
Alternatives and similar repositories for benchbench
Users that are interested in benchbench are comparing it to the libraries listed below
Sorting:
- 🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data …☆211Feb 16, 2026Updated last month
- ♠️TrucoBench: Qual é o melhor LLM no truco? Resultados, análises e insights estratégicos.☆19Feb 24, 2025Updated last year
- Esolang inspired by The Demon Girl Next Door(まちカドまぞく)☆12Apr 17, 2025Updated 11 months ago
- codebase release for EMNLP2023 paper publication☆19Sep 18, 2025Updated 6 months ago
- A simple model for predicting soccer outcomes☆11Jul 12, 2024Updated last year
- ☆13Jul 13, 2025Updated 8 months ago
- Find informative examples to efficiently (human)-evaluate NLG models.☆18Feb 27, 2026Updated 3 weeks ago
- Code and data for "A fine-grained comparison of pragmatic language understanding in humans and language models"☆11Dec 14, 2022Updated 3 years ago
- Python binding for Jagger(C++ implementation of Pattern-based Japanese Morphological Analyzer)☆12Dec 16, 2025Updated 3 months ago
- [EMNLP 2023] Official repository for Dialogue Chain-of-Thought Distillation (DONUT & DOCTOR)☆11Nov 15, 2023Updated 2 years ago
- Repo for SPOLIN corpus and paper "Grounding Conversations with Improvised Dialogues" (ACL2020)☆14Feb 20, 2026Updated last month
- EMNLP 2024 Tutorial: https://sites.google.com/view/reasoning-with-explanations☆14Apr 15, 2025Updated 11 months ago
- Understanding attention for text classification☆16Nov 27, 2020Updated 5 years ago
- Fluid Language Model Benchmarking☆27Sep 16, 2025Updated 6 months ago
- DIRECT: Direct and Indirect REsponses in Conversational Text Corpus☆17Jul 1, 2021Updated 4 years ago
- ☆15Dec 22, 2021Updated 4 years ago
- ☆29Dec 28, 2025Updated 2 months ago
- Official implementation of "OffsetBias: Leveraging Debiased Data for Tuning Evaluators"☆26Sep 11, 2024Updated last year
- ☆19Jul 18, 2024Updated last year
- bootstrap my zsh shell☆17Mar 10, 2026Updated last week
- [NeurIPS'24 Spotlight] Observational Scaling Laws☆60Oct 2, 2024Updated last year
- [COLING 2022]: CommunityLM: Probing Partisan Worldviews from Language Models☆15Jan 31, 2023Updated 3 years ago
- This project aims to convert the content of GitHub repositories into a structured, machine-readable format, enabling AI models like ChatG…☆12May 13, 2024Updated last year
- LLM evaluation.☆16Nov 7, 2023Updated 2 years ago
- ☆13Dec 29, 2023Updated 2 years ago
- pialign - A Phrasal ITG Aligner☆24Apr 29, 2019Updated 6 years ago
- Circa (meaning ‘approximately’) dataset aims to help machine learning systems to solve the problem of interpreting indirect answers to po…☆20Oct 8, 2020Updated 5 years ago
- Dataset shift diagnostics in Python☆33Sep 15, 2023Updated 2 years ago
- Code and Data for "GenAI Arena: An Open Evaluation Platform for Generative Models" [NeurIPS 2024]☆35Sep 8, 2024Updated last year
- The QA datasets used for DrQA evaluation.☆14Nov 30, 2018Updated 7 years ago
- Code for NeurIPS 2024 paper "Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs"☆46Feb 20, 2025Updated last year
- ☆46Mar 20, 2023Updated 3 years ago
- [ECCV 2024] M3DBench introduces a comprehensive 3D instruction-following dataset with support for interleaved multi-modal prompts.☆61Oct 1, 2024Updated last year
- ☆31Nov 23, 2022Updated 3 years ago
- The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.☆18Dec 28, 2023Updated 2 years ago
- ☆20May 30, 2025Updated 9 months ago
- ☆13Oct 5, 2025Updated 5 months ago
- This repository collects lecture slides, assignments (CAs), code notebooks, reports, and reference papers used in the "Deep Generative Mo…☆17Feb 14, 2026Updated last month
- [ICLR 2026] Official PyTorch implementation for "ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding"☆61Dec 26, 2025Updated 2 months ago