felipemaiapolo / promptevalLinks

Efficient multi-prompt evaluation of LLMs

☆22

Alternatives and similar repositories for prompteval

Users that are interested in prompteval are comparing it to the libraries listed below

Sorting:

allenai / discoverybench
Discovering Data-driven Hypotheses in the Wild
☆104Updated 2 months ago
felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆160Updated last year
neelsjain / BYOD
The Official Repository for "Bring Your Own Data! Self-Supervised Evaluation for Large Language Models"
☆107Updated last year
jongjyh / TrFr
Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning
☆46Updated last year
mlcommons / modelbench
Run safety benchmarks against AI models and view detailed reports showing how well they performed.
☆100Updated this week
tianyang-x / SaySelf
Public code repo for paper "SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales"
☆109Updated 10 months ago
probabilistic-inference-scaling / probabilistic-inference-scaling
☆51Updated 4 months ago
lifan-yuan / OOD_NLP
[NeurIPS 2023 D&B Track] Code and data for paper "Revisiting Out-of-distribution Robustness in NLP: Benchmarks, Analysis, and LLMs Evalua…
☆34Updated 2 years ago
Tiiiger / benchmark_llm_summarization
☆40Updated 2 years ago
HazyResearch / aioli
Aioli: A unified optimization framework for language model data mixing
☆27Updated 6 months ago
snap-stanford / optimas
Optimize Any User-defined Compound AI Systems
☆27Updated 2 weeks ago
yueyu1030 / AttrPrompt
[NeurIPS 2023] This is the code for the paper `Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias`.
☆153Updated last year
msakarvadia / AttentionLens
Interpretating the latent space representations of attention head outputs for LLMs
☆34Updated last year
viswavi / few-shot-clustering
☆78Updated 10 months ago
UW-Madison-Lee-Lab / LanguageInterfacedFineTuning
Code for Language-Interfaced FineTuning for Non-Language Machine Learning Tasks.
☆129Updated 9 months ago
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆71Updated last year
zjunlp / KnowledgeCircuits
[NeurIPS 2024] Knowledge Circuits in Pretrained Transformers
☆153Updated 5 months ago
primeqa / clapnq
☆41Updated 6 months ago
cambridgeltl / PairS
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; COLM 2024)
☆48Updated 6 months ago
OSU-NLP-Group / ScienceAgentBench
[ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
☆97Updated 2 months ago
patronus-ai / Lynx-hallucination-detection
☆41Updated last year
SparkJiao / StructTest
☆19Updated 3 weeks ago
XiangLi1999 / AutoBencher
☆29Updated last year
princeton-nlp / LitSearch
[EMNLP 2024] A Retrieval Benchmark for Scientific Literature Search
☆93Updated 8 months ago
ltgoslo / bert-in-context
Official implementation of "BERTs are Generative In-Context Learners"
☆32Updated 5 months ago
allenai / SciRIFF
Dataset and evaluation suite enabling LLM instruction-following for scientific literature understanding.
☆40Updated 4 months ago
declare-lab / trust-align
Codes and datasets for the paper Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Ref…
☆63Updated 5 months ago
KaiNylund / lm-weights-encode-time
☆69Updated 11 months ago
EagleW / Scientific-Inspiration-Machines-Optimized-for-Novelty
Official implementation of the ACL 2024: Scientific Inspiration Machines Optimized for Novelty
☆84Updated last year
allenai / marg-reviewer
Code/data for MARG (multi-agent review generation)
☆48Updated 9 months ago