zhangxjohn / LLM-Agent-Benchmark-ListLinks

A banchmark list for evaluation of large language models.

☆134

Alternatives and similar repositories for LLM-Agent-Benchmark-List

Users that are interested in LLM-Agent-Benchmark-List are comparing it to the libraries listed below

Sorting:

rxlqn / awesome-llm-self-reflection
augmented LLM with self reflection
☆129Updated last year
zorazrw / awesome-tool-llm
☆237Updated 11 months ago
hkust-nlp / AgentBoard
An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]
☆335Updated last year
StonyBrookNLP / appworld
🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…
☆232Updated 2 months ago
SALT-NLP / DyLAN
Official Implementation of Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization
☆158Updated last year
facebookresearch / sweet_rl
Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks
☆233Updated 3 months ago
zjunlp / WKM
[NeurIPS 2024] Agent Planning with World Knowledge Model
☆144Updated 7 months ago
kanishkg / cognitive-behaviors
☆203Updated 4 months ago
Yu-Fangxu / FoR
[ICML 2025] Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples
☆103Updated last week
jwhj / OREO
☆114Updated 6 months ago
zjunlp / WorfBench
[ICLR 2025] Benchmarking Agentic Workflow Generation
☆117Updated 5 months ago
jonathanmli / Avalon-LLM
This repository contains a LLM benchmark for the social deduction game `Resistance Avalon'
☆120Updated 2 months ago
Yifan-Song793 / ETO
Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)
☆147Updated 9 months ago
CMU-AIRe / MRT
Research Code for preprint "Optimizing Test-Time Compute via Meta Reinforcement Finetuning".
☆100Updated 3 weeks ago
tongyx361 / Awesome-LLM4Math
Curation of resources for LLM mathematical reasoning, most of which are screened by @tongyx361 to ensure high quality and accompanied wit…
☆133Updated last year
zankner / CLoud
Critique-out-Loud Reward Models
☆70Updated 9 months ago
GAIR-NLP / ToRL
☆258Updated 2 months ago
Berkeley-NLP / Agent-Eval-Refine
Code for Paper: Autonomous Evaluation and Refinement of Digital Agents [COLM 2024]
☆139Updated 8 months ago
icip-cas / Verifier-Engineering
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
☆61Updated 8 months ago
YifeiZhou02 / ArCHer
Research Code for "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL"
☆185Updated 3 months ago
Open-Source-O1 / o1_Reasoning_Patterns_Study
☆103Updated 8 months ago
openai / safety-rbr-code-and-data
Code and example data for the paper: Rule Based Rewards for Language Model Safety
☆190Updated last year
NumberChiffre / mcts-llm
☆95Updated 7 months ago
ReTool-RL / ReTool
☆174Updated 3 months ago
xingyaoww / mint-bench
Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…
☆128Updated last year
QwenLM / ProcessBench
Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"
☆166Updated 2 months ago
kohjingyu / search-agents
Code for the paper 🌳 Tree Search for Language Model Agents
☆208Updated last year
TIGER-AI-Lab / verl-tool
A version of verl to support tool use
☆312Updated last week
WooooDyy / LLM-Reverse-Curriculum-RL
Implementation of the ICML 2024 paper "Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning" pr…
☆107Updated last year
InfiAgent / InfiAgent
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks (ICML 2024)
☆140Updated 2 months ago