PKU-ONELab / Themis
The official repository for our NLG evaluation LLM Themis and the paper Themis: Towards Flexible and Interpretable NLG Evaluation.
β16Updated 4 months ago
Related projects β
Alternatives and complementary repositories for Themis
- π©Ί A collection of ChatGPT evaluation reports on various bechmarks.β48Updated last year
- GSM-Plus: Data, Code, and Evaluation for Enhancing Robust Mathematical Reasoning in Math Word Problems.β46Updated 4 months ago
- First explanation metric (diagnostic report) for text generation evaluationβ60Updated 4 months ago
- Technical Report: Is ChatGPT a Good NLG Evaluator? A Preliminary Studyβ42Updated last year
- Towards Systematic Measurement for Long Text Qualityβ28Updated 2 months ago
- β47Updated 2 months ago
- This project maintains a reading list for general text generation tasksβ65Updated 2 years ago
- Repo for "On Learning to Summarize with Large Language Models as References"β42Updated last year
- β16Updated 8 months ago
- GPT as Humanβ18Updated 9 months ago
- An open-source library for contamination detection in NLP datasets and Large Language Models (LLMs).β42Updated 3 months ago
- [EMNLP 2023] ALCUNA: Large Language Models Meet New Knowledgeβ25Updated last year
- β35Updated last year
- The code implementation of the EMNLP2022 paper: DisCup: Discriminator Cooperative Unlikelihood Prompt-tuning for Controllable Text Geneβ¦β25Updated last year
- Code and dataset for the emnlp paper titled Instruct and Extract: Instruction Tuning for On-Demand Information Extractionβ49Updated 10 months ago
- [ICLR'24 spotlight] Tool-Augmented Reward Modelingβ36Updated 8 months ago
- [ICML'2024] Can AI Assistants Know What They Don't Know?β70Updated 9 months ago
- Code for M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Modelsβ22Updated 3 months ago
- Code base of In-Context Learning for Dialogue State trackingβ44Updated last year
- Collection of papers for scalable automated alignment.β72Updated 3 weeks ago
- β23Updated last year
- Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communicationβ16Updated 7 months ago
- Code and data for "MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models"β30Updated 3 weeks ago
- Detect hallucinated tokens for conditional sequence generation.β63Updated 2 years ago
- The LM Contamination Index is a manually created database of contamination evidences for LMs.β75Updated 7 months ago
- Implementation of "Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation"β77Updated last year
- PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogβ¦β27Updated 3 years ago
- On Transferability of Prompt Tuning for Natural Language Processingβ97Updated 6 months ago
- Code and data for paper "Context-faithful Prompting for Large Language Models".β39Updated last year
- Source codes and datasets for How well do Large Language Models perform in Arithmetic tasks?β57Updated last year