Code and data for "MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models"
☆53Nov 18, 2025Updated 5 months ago
Alternatives and similar repositories for MT-Eval
Users that are interested in MT-Eval are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues☆148Jul 24, 2024Updated last year
- Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.☆162May 22, 2025Updated 11 months ago
- Official Code Repository for [AutoScale📈: Scale-Aware Data Mixing for Pre-Training LLMs] Published as a conference paper at **COLM 2025*…☆14Aug 8, 2025Updated 9 months ago
- Code for M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models☆23Jul 27, 2024Updated last year
- Repo for the EMNLP2021 paper: Lifelong Event Detection with Knowledge Transfer☆14Sep 2, 2021Updated 4 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- ☆59Aug 22, 2024Updated last year
- Fork of Bliss☆15Dec 13, 2025Updated 4 months ago
- Short RL☆18Apr 16, 2026Updated 3 weeks ago
- [EMNLP 2022] Code and data for "Controllable Dialogue Simulation with In-Context Learning"☆34Feb 22, 2023Updated 3 years ago
- ☆10Dec 19, 2023Updated 2 years ago
- ☆18Feb 29, 2024Updated 2 years ago
- An (incomplete) overview of information extraction☆43Apr 28, 2022Updated 4 years ago
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…☆139Jun 4, 2024Updated last year
- Detect-Then-Explain Framework for Text-to-SQL task☆10Dec 6, 2023Updated 2 years ago
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- ☆12Mar 16, 2025Updated last year
- Using self-play to augment multi-turn text-to-SQL datasets☆11Oct 20, 2022Updated 3 years ago
- ☆40Apr 13, 2026Updated 3 weeks ago
- The implementation for "Measuring Fine-Grained Domain Relevance of Terms: A Hierarchical Core-Fringe Approach" (ACL '21)☆16Jun 13, 2021Updated 4 years ago
- ☆16May 31, 2024Updated last year
- ☆10Jul 13, 2024Updated last year
- [ICME 2019] Source code and datasets for "Semi-supervised Compatibility Learning Across Categories for Clothing Matching"☆11Apr 26, 2024Updated 2 years ago
- A trainable user simulator☆34Jun 30, 2025Updated 10 months ago
- Code and data for CoachLM, an automatic instruction revision approach LLM instruction tuning.☆60Mar 20, 2024Updated 2 years ago
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- Probing and Generalization of Metaphorical Knowledge in Pre-Trained Language Modelss[ACL 2022]☆23May 15, 2022Updated 3 years ago
- ☆11Aug 13, 2023Updated 2 years ago
- Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation☆20Jun 11, 2025Updated 10 months ago
- VQA-Med 2021☆22Jul 11, 2022Updated 3 years ago
- ☆24Feb 16, 2025Updated last year
- Code and Dataset for the CVPRW Paper "Where did I leave my keys? — Episodic-Memory-Based Question Answering on Egocentric Videos"☆29Aug 28, 2023Updated 2 years ago
- Parses, Analyzes and Predicts for the Korean Baseball League☆17Dec 8, 2022Updated 3 years ago
- Implementation of "Decoding-time Realignment of Language Models", ICML 2024.☆21Jun 17, 2024Updated last year
- Benchmarking Complex Instruction-Following with Multiple Constraints Composition (NeurIPS 2024 Datasets and Benchmarks Track)☆103Feb 20, 2025Updated last year
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- ☆144Sep 10, 2023Updated 2 years ago
- A Bilingual Role Evaluation Benchmark for Large Language Models☆43Jan 9, 2024Updated 2 years ago
- Collection of papers for scalable automated alignment.☆93Oct 22, 2024Updated last year
- ☆14Jul 25, 2024Updated last year
- DocChecker: Bootstrapping Code-Text Pretrained Language Model to Detect Inconsistency Between Code and Comment☆16Jan 23, 2024Updated 2 years ago
- ☆20Nov 3, 2024Updated last year
- ☆25Apr 3, 2025Updated last year