IBM / benchbenchLinks
A package dedicated for running benchmark agreement testing
☆16Updated last month
Alternatives and similar repositories for benchbench
Users that are interested in benchbench are comparing it to the libraries listed below
Sorting:
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks☆43Updated 6 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆95Updated 3 weeks ago
- ☆38Updated last year
- ☆106Updated last year
- Python package for serving a local search engine. One command to download and serve a datastore---that's it 😎.☆21Updated 3 weeks ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment☆57Updated 9 months ago
- Code for Zero-Shot Tokenizer Transfer☆133Updated 5 months ago
- ReBase: Training Task Experts through Retrieval Based Distillation☆29Updated 4 months ago
- Code and Data for "Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering"☆84Updated 10 months ago
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆130Updated last year
- Repository for "Attribute First, then Generate: Locally-attributable Grounded Text Generation", ACL 2024☆29Updated 6 months ago
- Reference implementation for Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model☆43Updated last year
- Official code repo for paper "Great Memory, Shallow Reasoning: Limits of kNN-LMs"☆23Updated last month
- Dataset and evaluation suite enabling LLM instruction-following for scientific literature understanding.☆40Updated 3 months ago
- Codebase accompanying the Summary of a Haystack paper.☆78Updated 9 months ago
- ☆24Updated 4 months ago
- ☆51Updated 8 months ago
- List of papers on Self-Correction of LLMs.☆73Updated 6 months ago
- ☆29Updated 11 months ago
- Data and code for the preprint "In-Context Learning with Long-Context Models: An In-Depth Exploration"☆37Updated 10 months ago
- ☆57Updated 9 months ago
- A library for parameter-efficient and composable transfer learning for NLP with sparse fine-tunings.☆73Updated 10 months ago
- Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".☆78Updated last year
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆81Updated last year
- IntructIR, a novel benchmark specifically designed to evaluate the instruction following ability in information retrieval models. Our foc…☆32Updated last year
- Minimum Bayes Risk Decoding for Hugging Face Transformers☆58Updated last year
- ☆39Updated 2 years ago
- ☆19Updated last month
- Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; COLM 2024)☆47Updated 5 months ago