felipemaiapolo / promptevalLinks
Efficient multi-prompt evaluation of LLMs
☆20Updated 6 months ago
Alternatives and similar repositories for prompteval
Users that are interested in prompteval are comparing it to the libraries listed below
Sorting:
- This is the repo for constructing a comprehensive and rigorous evaluation framework for LLM calibration.☆13Updated last year
- ☆39Updated 2 years ago
- [NeurIPS 2023 D&B Track] Code and data for paper "Revisiting Out-of-distribution Robustness in NLP: Benchmarks, Analysis, and LLMs Evalua…☆33Updated 2 years ago
- This repository contains data, code and models for contextual noncompliance.☆23Updated 11 months ago
- Public code repo for paper "SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales"☆106Updated 8 months ago
- Token-level Reference-free Hallucination Detection☆94Updated last year
- Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; COLM 2024)☆47Updated 5 months ago
- The Official Repository for "Bring Your Own Data! Self-Supervised Evaluation for Large Language Models"☆108Updated last year
- In-context Example Selection with Influences☆15Updated 2 years ago
- ☆30Updated last year
- Dataset and evaluation suite enabling LLM instruction-following for scientific literature understanding.☆40Updated 3 months ago
- NLPBench: Evaluating NLP-Related Problem-solving Ability in Large Language Models☆10Updated last year
- Aioli: A unified optimization framework for language model data mixing☆27Updated 5 months ago
- ☆28Updated 4 months ago
- ☆29Updated 11 months ago
- AbstainQA, ACL 2024☆26Updated 8 months ago
- ☆18Updated 3 months ago
- ReBase: Training Task Experts through Retrieval Based Distillation☆29Updated 4 months ago
- Code/data for MARG (multi-agent review generation)☆44Updated 7 months ago
- [NeurIPS'23] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors☆77Updated 6 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- ☆68Updated 10 months ago
- Data and code for the preprint "In-Context Learning with Long-Context Models: An In-Depth Exploration"☆37Updated 10 months ago
- The TABLET benchmark for evaluating instruction learning with LLMs for tabular prediction.☆21Updated 2 years ago
- Evaluate the Quality of Critique☆35Updated last year
- Code for our paper: "GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models"☆55Updated 2 years ago
- Conformal Language Modeling☆30Updated last year
- The official repo for DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph☆16Updated 8 months ago
- [ACL 2023]: Training Trajectories of Language Models Across Scales https://arxiv.org/pdf/2212.09803.pdf☆24Updated last year
- Codebase the paper "The Remarkable Robustness of LLMs: Stages of Inference?"☆18Updated 2 weeks ago