ruixiangcui/AGIEval

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/ruixiangcui/AGIEval)

ruixiangcui / AGIEval

☆774

Alternatives and similar repositories for AGIEval

Users that are interested in AGIEval are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

OpenLMLab / GAOKAO-Bench
View on GitHub
GAOKAO-Bench is an evaluation framework that utilizes GAOKAO questions as a dataset to evaluate large language models.
☆784Jan 7, 2025Updated last year
hkust-nlp / ceval
View on GitHub
Official github repo for C-Eval, a Chinese evaluation suite for foundation models [NeurIPS 2023]
☆1,862Jul 27, 2025Updated last year
hendrycks / test
View on GitHub
Measuring Massive Multitask Language Understanding | ICLR 2021
☆1,603May 28, 2023Updated 3 years ago
ExpressAI / AI-Gaokao
View on GitHub
Gaokao Benchmark for AI
☆109Jul 8, 2022Updated 4 years ago
FranxYao / chain-of-thought-hub
View on GitHub
Benchmarking large language models' complex reasoning ability with chain-of-thought prompting
☆2,776Aug 4, 2024Updated last year
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
tatsu-lab / alpaca_eval
View on GitHub
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
☆2,007Aug 9, 2025Updated 11 months ago
haonan-li / CMMLU
View on GitHub
CMMLU: Measuring massive multitask language understanding in Chinese
☆829Dec 6, 2024Updated last year
google / BIG-bench
View on GitHub
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
☆3,250Jul 19, 2024Updated 2 years ago
suzgunmirac / BIG-Bench-Hard
View on GitHub
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
☆566Jun 25, 2024Updated 2 years ago
openai / prm800k
View on GitHub
800,000 step-level correctness labels on LLM solutions to MATH problems
☆2,152Jun 1, 2023Updated 3 years ago
Instruction-Tuning-with-GPT-4 / GPT-4-LLM
View on GitHub
Instruction Tuning with GPT-4
☆4,333Jun 11, 2023Updated 3 years ago
declare-lab / instruct-eval
View on GitHub
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
☆552Mar 10, 2024Updated 2 years ago
WeOpenML / PandaLM
View on GitHub
☆926May 22, 2024Updated 2 years ago
MLGroupJLU / LLM-eval-survey
View on GitHub
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
☆1,610Apr 17, 2026Updated 3 months ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
CarperAI / trlx
View on GitHub
A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
☆4,752Jan 8, 2024Updated 2 years ago
EleutherAI / lm-evaluation-harness
View on GitHub
A framework for few-shot evaluation of language models.
☆13,443Jul 13, 2026Updated 2 weeks ago
GanjinZero / RRHF
View on GitHub
[NIPS2023] RRHF & Wombat
☆805Sep 22, 2023Updated 2 years ago
openai / human-eval
View on GitHub
Code for the paper "Evaluating Large Language Models Trained on Code"
☆3,324Jan 17, 2025Updated last year
baichuan-inc / Baichuan-7B
View on GitHub
A large-scale 7B pretraining language model developed by BaiChuan-Inc.
☆5,651Jul 18, 2024Updated 2 years ago
LianjiaTech / BELLE
View on GitHub
BELLE: Be Everyone's Large Language model Engine（开源中文对话大模型）
☆8,279Oct 16, 2024Updated last year
THUDM / AgentBench
View on GitHub
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
☆3,611Feb 8, 2026Updated 5 months ago
open-compass / opencompass
View on GitHub
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, …
☆7,241Updated this week
allenai / natural-instructions
View on GitHub
Expanding natural instructions
☆1,045Dec 11, 2023Updated 2 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
FreedomIntelligence / LLMZoo
View on GitHub
⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡
☆2,939Nov 26, 2023Updated 2 years ago
CLUEbenchmark / SuperCLUE
View on GitHub
SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese
☆3,296Feb 6, 2026Updated 5 months ago
thunlp / UltraChat
View on GitHub
Large-scale, Informative, and Diverse Multi-round Chat Data (and Models)
☆2,877Mar 13, 2024Updated 2 years ago
OFA-Sys / gsm8k-ScRel
View on GitHub
Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
☆270Sep 12, 2024Updated last year
yizhongw / self-instruct
View on GitHub
Aligning pretrained language models with instruction data generated by themselves.
☆4,607Mar 27, 2023Updated 3 years ago
MikeGu721 / XiezhiBenchmark
View on GitHub
☆98Dec 5, 2023Updated 2 years ago
Timothyxxx / Chain-of-ThoughtsPapers
View on GitHub
A trend starts from "Chain of Thought Prompting Elicits Reasoning in Large Language Models".
☆2,105Oct 5, 2023Updated 2 years ago
stanford-crfm / helm
View on GitHub
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models …
☆2,865Jul 1, 2026Updated 3 weeks ago
google-research / FLAN
View on GitHub
☆1,566Jul 2, 2026Updated 3 weeks ago
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
deepspeedai / DeepSpeedExamples
View on GitHub
Example models using DeepSpeed
☆6,831Updated this week
AetherCortex / Llama-X
View on GitHub
Open Academic Research on Improving LLaMA to SOTA LLM
☆1,605Aug 30, 2023Updated 2 years ago
anthropics / hh-rlhf
View on GitHub
Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
☆1,853Jun 17, 2025Updated last year
bigscience-workshop / xmtf
View on GitHub
Crosslingual Generalization through Multitask Finetuning
☆535Sep 22, 2024Updated last year
PhoebusSi / Alpaca-CoT
View on GitHub
We unified the interfaces of instruction-tuning data (e.g., CoT data), multiple LLMs and parameter-efficient methods (e.g., lora, p-tunin…
☆2,791Dec 12, 2023Updated 2 years ago
huggingface / trl
View on GitHub
Train transformer language models with reinforcement learning.
☆18,953Updated this week
openai / grade-school-math
View on GitHub
☆1,453Jan 21, 2024Updated 2 years ago