potsawee / selfcheckgptLinks
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
β534Updated 11 months ago
Alternatives and similar repositories for selfcheckgpt
Users that are interested in selfcheckgpt are comparing it to the libraries listed below
Sorting:
- This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.β479Updated last year
- Evaluate your LLM's response with Prometheus and GPT4 π―β952Updated last month
- List of papers on hallucination detection in LLMs.β896Updated last week
- Github repository for "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models"β185Updated 6 months ago
- Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"β498Updated 5 months ago
- Codebase for reproducing the experiments of the semantic uncertainty paper (short-phrase and sentence-length experiments).β327Updated last year
- [EMNLP 2023] Enabling Large Language Models to Generate Text with Citations. Paper: https://arxiv.org/abs/2305.14627β490Updated 8 months ago
- [ICLR 2024 & NeurIPS 2023 WS] An Evaluator LM that is open-source, offers reproducible evaluation, and inexpensive to use. Specifically dβ¦β299Updated last year
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Modelβ530Updated 4 months ago
- Code and data for "Lost in the Middle: How Language Models Use Long Contexts"β347Updated last year
- Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMsβ254Updated last year
- A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomicβ¦β353Updated 2 months ago
- RefChecker provides automatic checking pipeline and benchmark dataset for detecting fine-grained hallucinations generated by Large Languaβ¦β373Updated last month
- Code for paper "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"β351Updated last year
- ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels β¦β268Updated last year
- RankLLM is a Python toolkit for reproducible information retrieval research using rerankers, with a focus on listwise reranking.β465Updated last week
- Generative Representational Instruction Tuningβ651Updated 3 months ago
- Automated Evaluation of RAG Systemsβ609Updated 2 months ago
- LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.β703Updated 8 months ago
- β283Updated last year
- Forward-Looking Active REtrieval-augmented generation (FLARE)β636Updated last year
- Data and Code for Program of Thoughts (TMLR 2023)β276Updated last year
- Representation Engineering: A Top-Down Approach to AI Transparencyβ836Updated 10 months ago
- Repository for "MultiHop-RAG: A Dataset for Evaluating Retrieval-Augmented Generation Across Documents" (COLM 2024)β326Updated 2 months ago
- A curated list of Human Preference Datasets for LLM fine-tuning, RLHF, and eval.β367Updated last year
- This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.β546Updated last year
- Official repository for ORPOβ455Updated last year
- Data and code for FreshLLMs (https://arxiv.org/abs/2310.03214)β363Updated last week
- RewardBench: the first evaluation tool for reward models.β604Updated last week
- [ACL2023] We introduce LLM-Blender, an innovative ensembling framework to attain consistently superior performance by leveraging the diveβ¦β946Updated 8 months ago