RUCAIBox / HaluEvalLinks

This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.

☆517

Alternatives and similar repositories for HaluEval

Users that are interested in HaluEval are comparing it to the libraries listed below

Sorting:

shmsw25 / FActScore
A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic…
☆394Updated 6 months ago
voidism / DoLa
Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"
☆520Updated 9 months ago
likenneth / honest_llama
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
☆553Updated 8 months ago
nelson-liu / lost-in-the-middle
Code and data for "Lost in the Middle: How Language Models Use Long Contexts"
☆360Updated last year
princeton-nlp / ALCE
[EMNLP 2023] Enabling Large Language Models to Generate Text with Citations. Paper: https://arxiv.org/abs/2305.14627
☆497Updated last year
teacherpeterpan / self-correction-llm-papers
This is a collection of research papers for Self-Correcting Large Language Models with Automated Feedback.
☆552Updated 11 months ago
potsawee / selfcheckgpt
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
☆571Updated last year
suzgunmirac / BIG-Bench-Hard
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
☆519Updated last year
glgh / awesome-llm-human-preference-datasets
A curated list of Human Preference Datasets for LLM fine-tuning, RLHF, and eval.
☆380Updated 2 years ago
AI21Labs / in-context-ralm
☆291Updated last year
wangcunxiang / LLM-Factuality-Survey
The repository for the survey paper <<Survey on Large Language Models Factuality: Knowledge, Retrieval and Domain-Specificity>>
☆339Updated last year
sylinrl / TruthfulQA
TruthfulQA: Measuring How Models Imitate Human Falsehoods
☆824Updated 9 months ago
jinlanfu / GPTScore
Source Code of Paper "GPTScore: Evaluate as You Desire"
☆257Updated 2 years ago
kojima-takeshi188 / zero_shot_cot
Prod Env
☆433Updated 2 years ago
princeton-nlp / LESS
[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
☆496Updated last year
LuckyyySTA / Awesome-LLM-hallucination
LLM hallucination paper list
☆323Updated last year
GAIR-NLP / auto-j
Generative Judge for Evaluating Alignment
☆247Updated last year
HITsz-TMG / awesome-llm-attributions
A Survey of Attributions for Large Language Models
☆216Updated last year
ParticleMedia / RAGTruth
Github repository for "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models"
☆205Updated 10 months ago
alon-albalak / data-selection-survey
A Survey on Data Selection for Language Models
☆250Updated 5 months ago
Shark-NLP / OpenICL
OpenICL is an open-source framework to facilitate research, development, and prototyping of in-context learning.
☆575Updated 2 years ago
allenai / reward-bench
RewardBench: the first evaluation tool for reward models.
☆643Updated 4 months ago
TIGER-AI-Lab / Program-of-Thoughts
Data and Code for Program of Thoughts [TMLR 2023]
☆289Updated last year
princeton-nlp / AutoCompressors
[EMNLP 2023] Adapting Language Models to Compress Long Contexts
☆314Updated last year
AlexTMallen / adaptive-retrieval
☆189Updated 3 months ago
llm-as-a-judge / Awesome-LLM-as-a-judge
☆447Updated 2 months ago
night-chen / ToolQA
ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels …
☆279Updated 2 years ago
Libr-AI / do-not-answer
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
☆294Updated last year
GaryYufei / AlignLLMHumanSurvey
Aligning Large Language Models with Human: A Survey
☆735Updated 2 years ago
nyu-mll / BBQ
Repository for the Bias Benchmark for QA dataset.
☆129Updated last year