zhangxjohn / LLM-Agent-Benchmark-ListLinks
A banchmark list for evaluation of large language models.
β130Updated 2 weeks ago
Alternatives and similar repositories for LLM-Agent-Benchmark-List
Users that are interested in LLM-Agent-Benchmark-List are comparing it to the libraries listed below
Sorting:
- augmented LLM with self reflectionβ129Updated last year
- π Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Papβ¦β221Updated 2 months ago
- β234Updated 11 months ago
- An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]β329Updated last year
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasksβ223Updated 2 months ago
- [NeurIPS 2024] Agent Planning with World Knowledge Modelβ141Updated 6 months ago
- Official Implementation of Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimizationβ154Updated last year
- This repository contains a LLM benchmark for the social deduction game `Resistance Avalon'β118Updated last month
- Code for the paper π³ Tree Search for Language Model Agentsβ205Updated 11 months ago
- [ICML 2025] Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examplesβ100Updated last month
- [ICLR 2025] Benchmarking Agentic Workflow Generationβ106Updated 4 months ago
- Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)β146Updated 8 months ago
- β199Updated 3 months ago
- Code for Paper: Autonomous Evaluation and Refinement of Digital Agents [COLM 2024]β138Updated 7 months ago
- Code and example data for the paper: Rule Based Rewards for Language Model Safetyβ188Updated 11 months ago
- A curated collection of LLM reasoning and planning resources, including key papers, limitations, benchmarks, and additional learning mateβ¦β281Updated 4 months ago
- β114Updated 5 months ago
- [ACL 2024] AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planningβ229Updated 6 months ago
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Zihaβ¦β126Updated last year
- FireAct: Toward Language Agent Fine-tuningβ279Updated last year
- β238Updated last month
- β95Updated 7 months ago
- Framework and toolkits for building and evaluating collaborative agents that can work together with humans.β88Updated 3 months ago
- AWM: Agent Workflow Memoryβ291Updated 5 months ago
- Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agentsβ112Updated last week
- "Improving Mathematical Reasoning with Process Supervision" by OPENAIβ110Updated 3 weeks ago
- Code for STaR: Bootstrapping Reasoning With Reasoning (NeurIPS 2022)β206Updated 2 years ago
- A Comprehensive Benchmark for Software Development.β111Updated last year
- InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks (ICML 2024)β137Updated last month
- β297Updated last year