JiayuJeff/CostBench

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/JiayuJeff/CostBench)

JiayuJeff / CostBench

The official code repository for the paper "CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents"

☆34

Alternatives and similar repositories for CostBench

Users that are interested in CostBench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

HKUST-KnowComp / NAACL
View on GitHub
The official codebase for our paper "NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems"
☆24Feb 28, 2026Updated 5 months ago
JiayuJeff / PlanBench-XL
View on GitHub
Official Repository for our paper: PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems
☆38Jul 16, 2026Updated last week
vlm2-bench / VLM2-Bench
View on GitHub
VLM2-Bench [ACL 2025 Main]: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues
☆45May 20, 2025Updated last year
HKUST-KnowComp / CSKB-Population
View on GitHub
Codes for the EMNLP2021 paper: Benchmarking Commonsense Knowledge Base Population (https://aclanthology.org/2021.emnlp-main.705.pdf). An …
☆26Feb 14, 2024Updated 2 years ago
RickySkywalker / LeanOfThought-Official
View on GitHub
This is the official implementation for MA-LoT.
☆20Aug 4, 2025Updated 11 months ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
rmin2000 / adv_tracing
View on GitHub
Identification of the Adversary from a Single Adversarial Example (ICML 2023)
☆10Jul 15, 2024Updated 2 years ago
lukahhcm / Awesome_Environment_Scaling
View on GitHub
Resources and paper list for 'Scaling Environments for Agents'. This repository accompanies our survey on how environments contribute to …
☆72Jan 28, 2026Updated 6 months ago
G-JWLee / TAMP
View on GitHub
☆12May 15, 2025Updated last year
YangHaolin0526 / MARS-SQL
View on GitHub
☆43Dec 19, 2025Updated 7 months ago
hkust-nlp / AgentVista
View on GitHub
Benchmarking multimodal agents on realistic, ultra-challenging visual scenarios requiring long-horizon hybrid tool use.
☆68Mar 10, 2026Updated 4 months ago
adobe-research / llava-score
View on GitHub
☆11Oct 2, 2024Updated last year
VITA-Group / Trap-and-Replace-Backdoor-Defense
View on GitHub
[NeurIPS'22] Trap and Replace: Defending Backdoor Attacks by Trapping Them into an Easy-to-Replace Subnetwork. Haotao Wang, Junyuan Hong,…
☆15Nov 27, 2023Updated 2 years ago
AI45Lab / DeepScan
View on GitHub
Diagnostic Framework for LLMs and MLLMs
☆39Mar 2, 2026Updated 4 months ago
FoundationAgents / AutoEnv
View on GitHub
Scaling Agentic Environments Automatically.
☆66Mar 26, 2026Updated 4 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
microsoft / Simia-Agent-Training
View on GitHub
Official Implementation of "Simulating Environments with Reasoning Models for Agent Training"
☆65Feb 18, 2026Updated 5 months ago
qiancheng0 / EscapeBench
View on GitHub
This is the repository for paper EscapeBench: Pushing Language Models to Think Outside the Box
☆18Dec 19, 2024Updated last year
HKUST-KnowComp / IntentionQA
View on GitHub
Code and data for the paper: IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Large Language Models …
☆12Apr 27, 2024Updated 2 years ago
Zhaoyi-Li21 / creme
View on GitHub
[ACL 2024] "Understanding and Patching Compositional Reasoning in LLMs"
☆14Aug 28, 2024Updated last year
AI45Lab / DeepSafe
View on GitHub
All-in-One Safety Evaluation Framwork
☆51Jul 15, 2026Updated 2 weeks ago
liujch1998 / vera
View on GitHub
☆17May 23, 2023Updated 3 years ago
X1AOX1A / Word2World
View on GitHub
[ACL 2026 Oral] From Word to World: Can Large Language Models be Implicit Text-based World Models?
☆66Apr 13, 2026Updated 3 months ago
X1AOX1A / ZoFiles
View on GitHub
Connect Claude to your Zotero library — Zotero plugin that mirrors collections as agent-readable folders with Markdown, BibTeX, and AI re…
☆17May 21, 2026Updated 2 months ago
TaiMingLu / know-dont-tell
View on GitHub
☆19Oct 14, 2024Updated last year
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
StarDewXXX / UltraHorizon
View on GitHub
Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios
☆27Sep 30, 2025Updated 9 months ago
ScarletPan / probase-concept
View on GitHub
A fast and neat API for Conceptualization of Probase
☆17Oct 28, 2019Updated 6 years ago
Raibows / CREAM
View on GitHub
Code for "CREAM: Consistency Regularized Self-Rewarding Language Models", ICLR 2025.
☆29Feb 17, 2025Updated last year
wzf2000 / THUCS
View on GitHub
Some material for THUCS courses.
☆53Jul 4, 2022Updated 4 years ago
IVY-LVLM / CODE
View on GitHub
Official Implementation of CODE
☆17Sep 26, 2024Updated last year
QingyuLiu / Agentic-Upward-Deception
View on GitHub
This repo is the official implementation of “Are Your Agents Upward Deceivers?”. The paper is accepted by ICML 2026.
☆24Dec 15, 2025Updated 7 months ago
zx1239856 / UndergradProjects
View on GitHub
Collections of Undergraduate Course Projects
☆22Jul 17, 2026Updated last week
Gen-Verse / GenEnv
View on GitHub
GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators
☆62Dec 23, 2025Updated 7 months ago
sustech-nlp / SPPO
View on GitHub
[ACL 2026 Oral] SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks official repos.
☆26May 18, 2026Updated 2 months ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
ShujinWu-0814 / ALOE
View on GitHub
Public code repo for COLING 2025 paper "Aligning LLMs with Individual Preferences via Interaction"
☆41Apr 3, 2025Updated last year
BKHMSI / cultural-trends
View on GitHub
Investigating Cultural Alignment of Large Language Models
☆13Aug 14, 2024Updated last year
hkust-nlp / LOCA-bench
View on GitHub
Benchmarking Language Agents Under Controllable and Extreme Context Growth
☆51Apr 29, 2026Updated 3 months ago
xiaomi-research / guievalkit
View on GitHub
[ICML 2026] GUIEvalKit: Open-source Evaluation Toolkit for GUI Agents
☆24Feb 26, 2026Updated 5 months ago
google / haloquest
View on GitHub
☆25Aug 2, 2024Updated last year
HarlynDN / WebCiteS
View on GitHub
[ACL'24] WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations
☆13Sep 11, 2024Updated last year
lifan-yuan / PLMCalibration
View on GitHub
Code for ACL 2023 paper "A Close Look into the Calibration of Pre-trained Language Models"
☆11May 9, 2023Updated 3 years ago