d223302 / LLM-EvaluationLinks

Can Large Language Models Be an Alternative to Human Evaluations?

☆9

Alternatives and similar repositories for LLM-Evaluation

Users that are interested in LLM-Evaluation are comparing it to the libraries listed below

Sorting:

Betswish / Cross-Lingual-Consistency
Easy-to-use framework for evaluating cross-lingual consistency of factual knowledge (Supported LLaMA, BLOOM, mT5, RoBERTa, etc.) Paper he…
☆25Updated 5 months ago
tingofurro / summac
Codebase, data and models for the SummaC paper in TACL
☆98Updated 6 months ago
ryokamoi / wice
This repository contains the dataset and code for "WiCE: Real-World Entailment for Claims in Wikipedia" in EMNLP 2023.
☆41Updated last year
artidoro / frank
FRANK: Factuality Evaluation Benchmark
☆57Updated 2 years ago
Yale-LILY / ROSE
☆39Updated 2 years ago
awebson / prompt_semantics
This repository accompanies our paper “Do Prompt-Based Models Really Understand the Meaning of Their Prompts?”
☆85Updated 3 years ago
google-research-datasets / xsum_hallucination_annotations
Faithfulness and factuality annotations of XSum summaries from our paper "On Faithfulness and Factuality in Abstractive Summarization" (h…
☆84Updated 4 years ago
salesforce / QAFactEval
☆50Updated 2 years ago
realtimeqa / realtimeqa_public
☆76Updated last year
microsoft / HaDes
Token-level Reference-free Hallucination Detection
☆96Updated 2 years ago
LanD-FBK / benchmark-gen-explanations
Codes for "Benchmarking the Generation of Fact Checking Explanations"
☆10Updated 11 months ago
yanaiela / pararel
☆45Updated last year
nelson-liu / evaluating-verifiability-in-generative-search-engines
Companion repo for "Evaluating Verifiability in Generative Search Engines".
☆83Updated 2 years ago
Betswish / MIRAGE
Easy-to-use MIRAGE code for faithful answer attribution in RAG applications. Paper: https://aclanthology.org/2024.emnlp-main.347/
☆24Updated 5 months ago
GXimingLu / neurologic_decoding
☆82Updated 2 years ago
nyu-mll / crows-pairs
This repository contains the data and code introduced in the paper "CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Maske…
☆122Updated last year
mcao516 / EntFA
☆27Updated 2 years ago
WadeYin9712 / GeoMLAMA
☆15Updated 2 years ago
McGill-NLP / bias-bench
ACL 2022: An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models.
☆143Updated 7 months ago
tylerachang / multilingual-geometry
The geometry of multilingual language model representations (EMNLP 2022).
☆21Updated 2 years ago
amazon-science / bold
Dataset associated with "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation" paper
☆79Updated 4 years ago
AlexTMallen / adaptive-retrieval
☆184Updated last month
jzbjyb / lm-calibration
☆35Updated 3 years ago
violet-zct / fairseq-detect-hallucination
Detect hallucinated tokens for conditional sequence generation.
☆64Updated 3 years ago
hitz-zentroa / lm-contamination
The LM Contamination Index is a manually created database of contamination evidences for LMs.
☆78Updated last year
McGill-NLP / FaithDial
☆51Updated 2 years ago
beinborn / relative_importance
☆17Updated last month
google-research / true
Code and data accompanying the paper "TRUE: Re-evaluating Factual Consistency Evaluation".
☆81Updated 3 weeks ago
SeaEval / SeaEval
NAACL 2024: SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning
☆25Updated 5 months ago
katherinethai / par3
☆28Updated 8 months ago