[ACL 2026] A large-scale longitudinal study on robust and fair evaluation of LLMs — 200K+ generative questions across 13 disciplines
☆39May 21, 2026Updated last month
Alternatives and similar repositories for LLMEval-Fair
Users that are interested in LLMEval-Fair are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [AAAI 2024] LLMEval Phase II dataset — professional domain evaluation across 12 academic disciplines☆71May 21, 2026Updated last month
- The code repository of paper "TransferTOD: A Generalizable Chinese Multi-Domain Task-Oriented Dialogue System with Transfer Capabilities"☆20May 12, 2026Updated last month
- [AAAI 2024] LLMEval Phase I dataset — 17 categories, 453 questions, 2186 annotators for Chinese LLM evaluation☆114May 21, 2026Updated last month
- [EMNLP 2023 Demo] "CLEVA: Chinese Language Models EVAluation Platform"☆64May 16, 2025Updated last year
- Chinese Generation Evaluation☆13Aug 14, 2023Updated 2 years ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- aigc evals☆10Dec 2, 2023Updated 2 years ago
- [ICLR 2025] Official code of "Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization"☆19Jun 1, 2024Updated 2 years ago
- ☆10Mar 13, 2023Updated 3 years ago
- [ACL 2024] Making Long-Context Language Models Better Multi-Hop Reasoners☆20May 28, 2024Updated 2 years ago
- Official code for the paper Improving Language Plasticity via Pretraining with Active Forgetting, NeurIPS 2023☆22Mar 12, 2026Updated 3 months ago
- Benchmarking Complex Instruction-Following with Multiple Constraints Composition (NeurIPS 2024 Datasets and Benchmarks Track)☆102Feb 20, 2025Updated last year
- ☆13Aug 12, 2022Updated 3 years ago
- Official repo for EscapeCraft (an 3D environment for room escape) and benchmark MM-Escape. This work is accepted by ICCV 2025.☆39Jul 7, 2025Updated 11 months ago
- The official implementation of the paper "Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models" (NeurIPS 2025 Pos…☆75Sep 29, 2025Updated 9 months ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- ☆13Mar 5, 2025Updated last year
- ☆47Oct 22, 2024Updated last year
- Source Code for <Target-Side Data Augmentation for Sequence Generation>☆12Oct 6, 2021Updated 4 years ago
- ☆13Aug 3, 2024Updated last year
- 汽车行业中文大模型测评基准,基于多轮开放式问题的细粒度评测☆38Dec 26, 2023Updated 2 years ago
- [SIGIR '25] This is the code repo for our SIGIR '25 paper: Enhancing the Patent Matching Capability of Large Language Models via Memory G…☆19Apr 22, 2025Updated last year
- ☆14May 20, 2022Updated 4 years ago
- code for Scaling Laws of RoPE-based Extrapolation☆73Oct 16, 2023Updated 2 years ago
- ☆48May 3, 2026Updated 2 months ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- Official implementation of Bayes Conditional Distribution Estimation for Knowledge Distillation Based on Conditional Mutual Information☆12Sep 28, 2023Updated 2 years ago
- ☆13Jan 21, 2024Updated 2 years ago
- [ICLR 2024] Towards Robust Multi-Modal Reasoning via Model Selection☆14Mar 7, 2024Updated 2 years ago
- IMPACT: A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design Patents (NeurIPS 2024)☆18Jul 14, 2025Updated 11 months ago
- A framework for pitting LLMs against each other in an evolving library of games ⚔☆35Apr 20, 2025Updated last year
- Code release for "MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning"☆11Oct 11, 2024Updated last year
- FlagEval is an evaluation toolkit for AI large foundation models.☆337Apr 24, 2025Updated last year
- Dataset containing Semantic Relations and Metadata, for Training and Evaluating Distributional Semantic Models in English and Mandarin Ch…☆16Aug 7, 2017Updated 8 years ago
- ☆12Jul 21, 2025Updated 11 months ago
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- Official repo for "PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning"☆148Feb 4, 2026Updated 5 months ago
- JudgeLRM: Large Reasoning Models as a Judge☆42May 6, 2026Updated last month
- GLM-SIMPLE-EVALS: The evaluation repository for the GLM-4.5 series of models by Z.ai.☆40Oct 17, 2025Updated 8 months ago
- ONNX Python Examples☆16Sep 13, 2022Updated 3 years ago
- Ruler: A Model-Agnostic Method to Control Generated Length for Large Language Models☆41Sep 30, 2024Updated last year
- Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models☆35Oct 19, 2023Updated 2 years ago
- [AAAI 2023] This is the code for our paper `Neighborhood-Regularized Self-Training for Learning with Few Labels'.☆12Jan 11, 2023Updated 3 years ago