open-compass / BotChatLinks

Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.

☆158

Alternatives and similar repositories for BotChat

Users that are interested in BotChat are comparing it to the libraries listed below

Sorting:

QwenLM / AutoIF
☆312Updated last year
GAIR-NLP / auto-j
Generative Judge for Evaluating Alignment
☆247Updated last year
open-compass / T-Eval
[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
☆294Updated last year
OFA-Sys / InsTag
InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning
☆278Updated 2 years ago
thu-coai / CritiqueLLM
☆147Updated last year
THUDM / LongAlign
[EMNLP 2024] LongAlign: A Recipe for Long Context Alignment of LLMs
☆257Updated 10 months ago
OpenBMB / UltraFeedback
A large-scale, fine-grained, diverse preference dataset (and models).
☆354Updated last year
SqueezeAILab / LLM2LLM
[ACL 2024] LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
☆190Updated last year
thu-coai / BPO
☆330Updated last year
Spico197 / Humpback
🐋 An unofficial implementation of Self-Alignment with Instruction Backtranslation.
☆138Updated 5 months ago
Junjie-Ye / ToolEyes
[COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
☆69Updated 5 months ago
hkust-nlp / deita
Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
☆572Updated 10 months ago
thu-coai / ComplexBench
Benchmarking Complex Instruction-Following with Multiple Constraints Composition (NeurIPS 2024 Datasets and Benchmarks Track)
☆95Updated 8 months ago
OFA-Sys / Ditto
A self-ailgnment method for role-play. Benchmark for role-play. Resources for "Large Language Models are Superpositions of All Characters…
☆204Updated last year
anchen1011 / FireAct
FireAct: Toward Language Agent Fine-tuning
☆283Updated 2 years ago
icip-cas / ChatAlpaca
A Multi-Turn Dialogue Corpus based on Alpaca Instructions
☆175Updated 2 years ago
DAMO-NLP-SG / M3Exam
Data and code for paper "M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models"
☆102Updated 2 years ago
nick7nlp / Counting-Stars
Counting-Stars (★)
☆83Updated 4 months ago
X-PLUG / Multi-LLM-Agent
☆233Updated last year
tianyi-lab / Cherry_LLM
[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other mo…
☆398Updated 4 months ago
zjunlp / AutoAct
[ACL 2024] AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning
☆229Updated 9 months ago
zexuanqiu / CLongEval
CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models
☆45Updated last year
LaVi-Lab / CLEVA
[EMNLP 2023 Demo] "CLEVA: Chinese Language Models EVAluation Platform"
☆62Updated 5 months ago
i-Eval / FairEval
☆141Updated 2 years ago
GAIR-NLP / OPO
☆51Updated last year
bigai-nlco / LooGLE
ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models
☆186Updated last year
night-chen / ToolQA
ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels …
☆280Updated 2 years ago
RUC-GSAI / YuLan-IR
YuLan-IR: Information Retrieval Boosted LMs
☆221Updated last year
raunak-agarwal / instruction-datasets
Datasets for Instruction Tuning of Large Language Models
☆257Updated last year
chuanyang-Zheng / Progressive-Hint
This is the official implementation of "Progressive-Hint Prompting Improves Reasoning in Large Language Models"
☆209Updated 2 years ago