llmeval / llmeval-3Links

中文大语言模型评测第三期

☆26

Alternatives and similar repositories for llmeval-3

Users that are interested in llmeval-3 are comparing it to the libraries listed below

Sorting:

zexuanqiu / CLongEval
CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models
☆40Updated last year
THUDM / ChatGLM-Math
☆82Updated last year
thu-coai / CritiqueLLM
☆144Updated last year
nick7nlp / Counting-Stars
Counting-Stars (★)
☆83Updated last month
GAIR-NLP / OPO
☆50Updated last year
MikeGu721 / XiezhiBenchmark
☆96Updated last year
FlagOpen / Infinity-Instruct
☆48Updated last year
OpenMOSS / Say-I-Dont-Know
[ICML'2024] Can AI Assistants Know What They Don't Know?
☆81Updated last year
LaVi-Lab / CLEVA
[EMNLP 2023 Demo] "CLEVA: Chinese Language Models EVAluation Platform" [ACL 2025 Findings] "C2LEVA: Toward Comprehensive and Contaminatio…
☆63Updated 2 months ago
THUDM / LongReward
☆56Updated 8 months ago
Junjie-Ye / ToolEyes
[COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
☆68Updated 2 months ago
pldlgb / nuggets
☆84Updated last year
OpenMOSS / HalluQA
Dataset and evaluation script for "Evaluating Hallucinations in Chinese Large Language Models"
☆130Updated last year
KwaiKEG / CogGPT
Unleashing the Power of Cognitive Dynamics on Large Language Models
☆62Updated 9 months ago
KbsdJames / MATH-Minos
The implementation of paper "LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Fee…
☆38Updated 11 months ago
lfy79001 / TableQAKit
A Toolkit for Table-based Question Answering
☆112Updated last year
sail-sg / sdft
[ACL 2024] The official codebase for the paper "Self-Distillation Bridges Distribution Gap in Language Model Fine-tuning".
☆124Updated 8 months ago
CASIA-LM / MoDS
☆142Updated last year
OpenBMB / UltraEval
[ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.
☆244Updated 8 months ago
Felixgithub2017 / CG-Eval
Chinese Generation Evaluation
☆12Updated last year
OFA-Sys / DiverseEvol
Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning
☆81Updated last year
cavalierlulu / rag_survey
☆124Updated last year
RUC-GSAI / Llama-3-SynE
Llama-3-SynE: A Significantly Enhanced Version of Llama-3 with Advanced Scientific Reasoning and Chinese Language Capabilities | 继续预训练提升 …
☆33Updated last month
llmeval / llmeval-2
中文大语言模型评测第二期
☆70Updated last year
thu-coai / AutoDetect
Official github repo for AutoDetect, an automated weakness detection framework for LLMs.
☆42Updated last year
RUCKBReasoning / CoT-based-Synthesizer
Official code implementation for the ACL 2025 paper: 'CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis'
☆27Updated last month
GAIR-NLP / ReAlign
Reformatted Alignment
☆113Updated 9 months ago
meowpass / FollowComplexInstruction
Official implementation of the paper "From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large L…
☆50Updated last year
Spico197 / Humpback
🐋 An unofficial implementation of Self-Alignment with Instruction Backtranslation.
☆140Updated 2 months ago
tjunlp-lab / M3KE
A Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark
☆102Updated last year