LLM evaluation.
☆16Nov 7, 2023Updated 2 years ago
Alternatives and similar repositories for llm_eval
Users that are interested in llm_eval are comparing it to the libraries listed below
Sorting:
- An end-to-end benchmark suite of multi-modal DNN applications for system-architecture co-design☆22Dec 13, 2024Updated last year
- ☆21Aug 19, 2024Updated last year
- Code for paper "Point and Ask: Incorporating Pointing into Visual Question Answering"☆19Oct 4, 2022Updated 3 years ago
- Repository for NPHardEval, a quantified-dynamic benchmark of LLMs☆63Mar 26, 2024Updated last year
- ☆27Jul 20, 2024Updated last year
- [NeurIPS'24] Official PyTorch Implementation of Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment☆58Sep 26, 2024Updated last year
- Data and code for paper "M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models"☆103Jun 15, 2023Updated 2 years ago
- [ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.☆256Oct 30, 2024Updated last year
- 汽车行业中文大模型测评基准,基于多轮开放式问题的细粒度评测☆38Dec 26, 2023Updated 2 years ago
- Repo for paper: "Paxion: Patching Action Knowledge in Video-Language Foundation Models" Neurips 23 Spotlight☆37May 23, 2023Updated 2 years ago
- The Oyster series is a set of safety models developed in-house by Alibaba-AAIG, devoted to building a responsible AI ecosystem. | Oyster …☆59Sep 11, 2025Updated 5 months ago
- GAOGAO-Bench-Updates is a supplement to the GAOKAO-Bench, a dataset to evaluate large language models.☆39Jan 7, 2025Updated last year
- SC-Safety: 中文大模型多轮对抗安全基准☆150Mar 15, 2024Updated last year
- Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models☆45Jun 14, 2024Updated last year
- Evaluating LLMs with CommonGen-Lite☆95Mar 21, 2024Updated last year
- 【ACL 2024】 SALAD benchmark & MD-Judge☆171Mar 8, 2025Updated 11 months ago
- ☆12Jan 11, 2026Updated last month
- A framework for few-shot evaluation of autoregressive language models.☆12Jul 14, 2025Updated 7 months ago
- A Swedish Natural Language Understanding Benchmark☆11Dec 12, 2025Updated 2 months ago
- DOMAINEVAL is an auto-constructed benchmark for multi-domain code generation that consists of 2k+ subjects (i.e., description, reference …☆14Dec 12, 2024Updated last year
- [CVPR2024] Learning from Synthetic Human Group Activities☆14Feb 24, 2025Updated last year
- S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models☆109Feb 13, 2026Updated 2 weeks ago
- Official github repo for SafetyBench, a comprehensive benchmark to evaluate LLMs' safety. [ACL 2024]☆273Jul 28, 2025Updated 7 months ago
- Synthesize bio-plausible neural networks for cognitive tasks, mimicking brain architecture☆11Apr 14, 2021Updated 4 years ago
- Code and Data for GlitchBench☆13Feb 27, 2024Updated 2 years ago
- Shaping Language Models with Cognitive Insights☆15Feb 29, 2024Updated 2 years ago
- Dataset for AAAI paper "Natural Language Inference in Context - Investigating Contextual Reasoning over Long Texts"☆11Nov 18, 2022Updated 3 years ago
- Rationale-enhanced language models are better continual relation learners (EMNLP 2023 Main Conference)☆12Oct 11, 2023Updated 2 years ago
- ☆11Mar 13, 2023Updated 2 years ago
- [ACL 2025 Main] (🏆 Outstanding Paper Award) Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Proba…☆15Aug 15, 2025Updated 6 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆43Feb 15, 2024Updated 2 years ago
- LLM red teaming datasets from the paper 'Student-Teacher Prompting for Red Teaming to Improve Guardrails' for the ART of Safety Workshop …☆22Oct 12, 2023Updated 2 years ago
- ☆11Jan 3, 2024Updated 2 years ago
- LLM benchmarks☆13Feb 22, 2024Updated 2 years ago
- AIGC 系列报告 2022-2023☆11Feb 25, 2024Updated 2 years ago
- Website for release of TellMeWhy dataset for why question answering☆14Nov 11, 2022Updated 3 years ago
- Code for our project CROWN (Conversational Passage Ranking by Reasoning over Word Networks)☆10Jan 11, 2024Updated 2 years ago
- ☆12Mar 5, 2025Updated 11 months ago
- benchmarks for evaluating MT models☆11Jun 26, 2024Updated last year