Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
☆161May 22, 2025Updated 9 months ago
Alternatives and similar repositories for BotChat
Users that are interested in BotChat are comparing it to the libraries listed below
Sorting:
- ☆11Nov 5, 2024Updated last year
- [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset☆111May 22, 2025Updated 9 months ago
- Code and data for "MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models"☆51Nov 18, 2025Updated 3 months ago
- A Dataset for Multi-Turn Dialogue Reasoning☆332Oct 7, 2020Updated 5 years ago
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…☆133Jun 4, 2024Updated last year
- [ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues☆141Jul 24, 2024Updated last year
- The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"☆56May 22, 2025Updated 9 months ago
- [NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs☆49Nov 29, 2024Updated last year
- This is the oficial repository for "Safer-Instruct: Aligning Language Models with Automated Preference Data"☆17Feb 22, 2024Updated 2 years ago
- ☆16Oct 21, 2024Updated last year
- Code for ACL 2021 paper "Unsupervised Out-of-Domain Detection via Pre-trained Transformers"☆30Aug 20, 2021Updated 4 years ago
- Generative Judge for Evaluating Alignment☆250Jan 18, 2024Updated 2 years ago
- RewardBench: the first evaluation tool for reward models.☆696Feb 16, 2026Updated last week
- Lawma: A lightly fine-tuned Llama model for legal classification tasks.☆28Sep 14, 2024Updated last year
- ☆10Nov 7, 2022Updated 3 years ago
- ☆58Nov 1, 2021Updated 4 years ago
- 大模型多维度中文对齐评测基准 (ACL 2024)☆420Oct 25, 2025Updated 4 months ago
- ☆26Nov 21, 2022Updated 3 years ago
- [EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs☆29May 22, 2025Updated 9 months ago
- Official repository of MMDU dataset☆104Sep 29, 2024Updated last year
- Towards Quantifiable Dialogue Coherence Evaluation (ACL 2021)☆59Oct 26, 2021Updated 4 years ago
- python project template for personal projects! 🙋♀️☆11Nov 28, 2020Updated 5 years ago
- [Findings of ACL-2023] This is the official implementation of On the Difference of BERT-style and CLIP-style Text Encoders.☆14Jun 7, 2023Updated 2 years ago
- Official Repo of "CIBench: Evaluation of LLMs as Code Interpreter "☆14Jul 19, 2024Updated last year
- Code and Data for EMNLP 2023 Paper "MenatQA: A New Dataset for Testing the Temporal Comprehension and Reasoning Abilities of Large Langu…☆14Apr 7, 2025Updated 10 months ago
- Official repository for "DEnsity: Open-domain Dialogue Evaluation Metric using Density Estimation (ACL2023 Findings)"☆11May 23, 2023Updated 2 years ago
- OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, …☆6,688Updated this week
- An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.☆1,953Aug 9, 2025Updated 6 months ago
- ☆160Nov 23, 2024Updated last year
- Chinese Generation Evaluation☆13Aug 14, 2023Updated 2 years ago
- [ICLR24] The open-source repo of THU-KEG's KoLA benchmark.☆52Sep 28, 2023Updated 2 years ago
- MetricEval: A framework that conceptualizes and operationalizes four main components of metric evaluation, in terms of reliability and va…☆12Nov 6, 2023Updated 2 years ago
- Simple (fast) transformer inference in PyTorch with torch.compile + lit-llama code☆10Aug 29, 2023Updated 2 years ago
- [NeurIPS 2024] 🧼🔎 A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors.☆36Oct 14, 2025Updated 4 months ago
- Repository for NPHardEval, a quantified-dynamic benchmark of LLMs☆63Mar 26, 2024Updated last year
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆803Jul 16, 2025Updated 7 months ago
- Transforms OWL ontologies into FHIR code systems.☆22Apr 3, 2023Updated 2 years ago
- This repository contains the files used for our Interspeech 2017 paper.☆16May 30, 2017Updated 8 years ago
- Investigating Cultural Alignment of Large Language Models☆13Aug 14, 2024Updated last year