nlpyang / gevalLinks

Code for paper "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"

☆381

Alternatives and similar repositories for geval

Users that are interested in geval are comparing it to the libraries listed below

Sorting:

potsawee / selfcheckgpt
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
☆567Updated last year
ParticleMedia / RAGTruth
Github repository for "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models"
☆203Updated 10 months ago
RUCAIBox / HaluEval
This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.
☆515Updated last year
jinlanfu / GPTScore
Source Code of Paper "GPTScore: Evaluate as You Desire"
☆257Updated 2 years ago
prometheus-eval / prometheus
[ICLR 2024 & NeurIPS 2023 WS] An Evaluator LM that is open-source, offers reproducible evaluation, and inexpensive to use. Specifically d…
☆305Updated last year
princeton-nlp / ALCE
[EMNLP 2023] Enabling Large Language Models to Generate Text with Citations. Paper: https://arxiv.org/abs/2305.14627
☆498Updated last year
shmsw25 / FActScore
A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic…
☆386Updated 6 months ago
asahi417 / lm-question-generation
Multilingual/multidomain question generation datasets, models, and python library for question generation.
☆364Updated last year
AI21Labs / in-context-ralm
☆291Updated last year
nlp-uoregon / mlmm-evaluation
Multilingual Large Language Models Evaluation Benchmark
☆132Updated last year
freshllms / freshqa
Data and code for FreshLLMs (https://arxiv.org/abs/2310.03214)
☆375Updated last week
kaistAI / CoT-Collection
[EMNLP 2023] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning
☆246Updated last year
nelson-liu / lost-in-the-middle
Code and data for "Lost in the Middle: How Language Models Use Long Contexts"
☆359Updated last year
asahi417 / lmppl
Calculate perplexity on a text with pre-trained language models. Support MLM (eg. DeBERTa), recurrent LM (eg. GPT3), and encoder-decoder …
☆162Updated 3 months ago
night-chen / ToolQA
ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels …
☆278Updated 2 years ago
naver / bergen
Benchmarking library for RAG
☆230Updated this week
voidism / DoLa
Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"
☆518Updated 8 months ago
facebookresearch / contriever
Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning
☆757Updated 2 years ago
sunnweiwei / RankGPT
Is ChatGPT Good at Search? LLMs as Re-Ranking Agent [EMNLP 2023 Outstanding Paper Award]
☆639Updated last year
facebookresearch / CRAG
Comprehensive benchmark for RAG
☆219Updated 4 months ago
shizhediao / active-prompt
Source code for the paper "Active Prompting with Chain-of-Thought for Large Language Models"
☆245Updated last year
Libr-AI / do-not-answer
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
☆292Updated last year
prometheus-eval / prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯
☆1,002Updated 5 months ago
amazon-science / RefChecker
RefChecker provides automatic checking pipeline and benchmark dataset for detecting fine-grained hallucinations generated by Large Langua…
☆395Updated 4 months ago
jzbjyb / FLARE
Forward-Looking Active REtrieval-augmented generation (FLARE)
☆654Updated last year
xfactlab / orpo
Official repository for ORPO
☆463Updated last year
glgh / awesome-llm-human-preference-datasets
A curated list of Human Preference Datasets for LLM fine-tuning, RLHF, and eval.
☆380Updated 2 years ago
EdinburghNLP / awesome-hallucination-detection
List of papers on hallucination detection in LLMs.
☆969Updated 4 months ago
madaan / self-refine
LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.
☆742Updated last year
salesforce / DialogStudio
DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection and Instruction-Aware Models for Conversational AI
☆515Updated 8 months ago