IBM / benchbench
A package dedicated for running benchmark agreement testing
☆13Updated this week
Related projects ⓘ
Alternatives and complementary repositories for benchbench
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment☆46Updated 2 months ago
- Reference implementation for Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model☆41Updated 10 months ago
- Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; arXiv preprint arXiv:2403.…☆37Updated 4 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆41Updated 9 months ago
- Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs☆43Updated 4 months ago
- ☆38Updated 7 months ago
- Repository for NPHardEval, a quantified-dynamic benchmark of LLMs☆48Updated 7 months ago
- Code for Zero-Shot Tokenizer Transfer☆115Updated last month
- Code for PHATGOOSE introduced in "Learning to Route Among Specialized Experts for Zero-Shot Generalization"☆78Updated 8 months ago
- Minimum Bayes Risk Decoding for Hugging Face Transformers☆56Updated 5 months ago
- ☆112Updated last month
- ☆36Updated 5 months ago
- [NeurIPS 2024] Goldfish Loss: Mitigating Memorization in Generative LLMs☆74Updated this week
- ☆126Updated 7 months ago
- ☆25Updated 11 months ago
- ☆46Updated this week
- SILO Language Models code repository☆80Updated 8 months ago
- Code for the examples presented in the talk "Training a Llama in your backyard: fine-tuning very large models on consumer hardware" given…☆14Updated last year
- SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 la…☆44Updated last year
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆63Updated last year
- ☆47Updated 9 months ago
- Codebase accompanying the Summary of a Haystack paper.☆72Updated 2 months ago
- FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions☆40Updated 4 months ago
- This is a new metric that can be used to evaluate faithfulness of text generated by LLMs. The work behind this repository can be found he…☆31Updated last year
- ☆43Updated last month
- Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Fl…☆58Updated 3 months ago
- Evaluating LLMs with fewer examples☆135Updated 7 months ago
- Lightweight tool to identify Data Contamination in LLMs evaluation☆42Updated 8 months ago
- This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…☆19Updated 8 months ago