☆21Aug 19, 2024Updated last year
Alternatives and similar repositories for HalluDial
Users that are interested in HalluDial are comparing it to the libraries listed below
Sorting:
- [ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2 & [ICLR 2025] Mask-DPO☆62Apr 30, 2025Updated 9 months ago
- ☆49Jan 7, 2024Updated 2 years ago
- ☆16Sep 27, 2023Updated 2 years ago
- Flames is a highly adversarial benchmark in Chinese for LLM's harmlessness evaluation developed by Shanghai AI Lab and Fudan NLP Group.☆63May 21, 2024Updated last year
- ☆17Dec 21, 2023Updated 2 years ago
- LLM evaluation.☆16Nov 7, 2023Updated 2 years ago
- [EMNLP 2024] A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models.☆20Sep 23, 2024Updated last year
- An evaluation suite for Retrieval-Augmented Generation (RAG).☆23Apr 26, 2025Updated 10 months ago
- ☆22Feb 3, 2024Updated 2 years ago
- Code and data for the FACTOR paper☆53Nov 15, 2023Updated 2 years ago
- Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors (ACL 2023)☆28Mar 26, 2024Updated last year
- [NAACL 2025 Main Selected Oral] Repository for the paper: Prompt Compression for Large Language Models: A Survey☆36May 18, 2025Updated 9 months ago
- 第二届“泰迪杯”数据分析职业技能大赛A 题☆10Sep 15, 2020Updated 5 years ago
- Code for our CIKM'21 paper "Complex Temporal Qestion Answering on Knowledge Graphs"☆31Jan 13, 2024Updated 2 years ago
- [ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks: UHGEval, HaluEval, HalluQA, etc.☆180Jun 7, 2025Updated 8 months ago
- Dataset and evaluation script for "Evaluating Hallucinations in Chinese Large Language Models"☆136Jun 5, 2024Updated last year
- 汽车行业中文大模型测评基准,基于多轮开放式问题的细粒度评测☆38Dec 26, 2023Updated 2 years ago
- BeHonest: Benchmarking Honesty in Large Language Models☆34Aug 15, 2024Updated last year
- The source code of [WWW 2025] MoDiCF☆12Jul 12, 2025Updated 7 months ago
- A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)☆174Jun 27, 2025Updated 8 months ago
- [IJCAI 2024] FactCHD: Benchmarking Fact-Conflicting Hallucination Detection☆90Apr 28, 2024Updated last year
- DOMAINEVAL is an auto-constructed benchmark for multi-domain code generation that consists of 2k+ subjects (i.e., description, reference …☆14Dec 12, 2024Updated last year
- TOD-Flow: Modeling the Structure of Task-Oriented Dialogues☆13Feb 7, 2024Updated 2 years ago
- 第八届“泰迪杯”数据挖掘挑战赛的一点心得☆10Nov 26, 2020Updated 5 years ago
- CDbw Index For Cluster Validation☆10Mar 26, 2019Updated 6 years ago
- [CVPR2024] Learning from Synthetic Human Group Activities☆14Feb 24, 2025Updated last year
- Winning solution of the Microsoft Research "First TextWorld Problems: A Reinforcement and Language Learning Challenge"☆12Jun 21, 2022Updated 3 years ago
- Evaluation Pipeline for medical tasks.☆12Feb 13, 2026Updated 2 weeks ago
- ☆12Jan 11, 2026Updated last month
- A Swedish Natural Language Understanding Benchmark☆11Dec 12, 2025Updated 2 months ago
- A framework for few-shot evaluation of autoregressive language models.☆12Jul 14, 2025Updated 7 months ago
- MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols☆16Nov 19, 2025Updated 3 months ago
- Classification of human emotion using multi-modal models☆12Jun 27, 2020Updated 5 years ago
- [NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI☆107Mar 6, 2025Updated 11 months ago
- ☆43Sep 3, 2024Updated last year
- A python tool help to interact with chatgpt.☆10Dec 11, 2022Updated 3 years ago
- ☆12Mar 5, 2025Updated 11 months ago
- ☆10Apr 11, 2022Updated 3 years ago
- Code and Data for GlitchBench☆13Feb 27, 2024Updated 2 years ago