ToolBeHonest/ToolBeHonest

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/ToolBeHonest/ToolBeHonest)

ToolBeHonest / ToolBeHonest

[EMNLP 2024] A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models.

☆20

Alternatives and similar repositories for ToolBeHonest

Users that are interested in ToolBeHonest are comparing it to the libraries listed below

Sorting:

TianHongZXY / qaap
View on GitHub
[EMNLP 2023] Question Answering as Programming for Solving Time-Sensitive Questions
☆12Dec 18, 2023Updated 2 years ago
ChartMimic / ChartMimic
View on GitHub
[ICLR 2025] ChartMimic: Evaluating LMM’s Cross-Modal Reasoning Capability via Chart-to-Code Generation
☆131Dec 19, 2025Updated 2 months ago
gjq100 / Graph-Counselor
View on GitHub
☆27Jun 5, 2025Updated 8 months ago
TianHongZXY / RLVR-Decomposed
View on GitHub
[NeurIPS 2025] Implementation for the paper "The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning"
☆161Oct 28, 2025Updated 4 months ago
SihengLi99 / LLM-Honesty-Survey
View on GitHub
[2025-TMLR] A Survey on the Honesty of Large Language Models
☆64Dec 8, 2024Updated last year
thunlp / AutoForm
View on GitHub
Code for paper "Beyond Natural Language: LLMs Leveraging Alternative Formats for Enhanced Reasoning and Communication"
☆22Mar 30, 2024Updated last year
LHRYANG / FSD
View on GitHub
Implementation of LREC-COLING 2024 paper A Frustratingly Simple Decoding Method for Neural Text Generation
☆19Feb 23, 2024Updated 2 years ago
flageval-baai / HalluDial
View on GitHub
☆21Aug 19, 2024Updated last year
TianHongZXY / CoRe
View on GitHub
[ACL 2023] Solving Math Word Problems via Cooperative Reasoning induced Language Models (LLMs + MCTS + Self-Improvement)
☆50Dec 15, 2023Updated 2 years ago
LLMSQL / llmsql-benchmark
View on GitHub
A Text2SQL benchmark for evaluation of Large Language Models
☆41Updated this week
Yarayx / livelongbench
View on GitHub
The first spoken long-text dataset derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-w…
☆12Jun 28, 2025Updated 8 months ago
DavidFanzz / SCMoE
View on GitHub
☆29May 24, 2024Updated last year
eric-ai-lab / MMWorld
View on GitHub
Official repo of the ICLR 2025 paper "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos"
☆28Jul 15, 2025Updated 7 months ago
justinlovelace / Diffusion-Guided-LM
View on GitHub
☆29Oct 20, 2025Updated 4 months ago
wln20 / CSKV
View on GitHub
[NeurIPS ENLSP Workshop'24] CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios
☆16Oct 18, 2024Updated last year
marinero4972 / CyberV
View on GitHub
☆18Jun 10, 2025Updated 8 months ago
Ch3nYe / AutoCompiler
View on GitHub
☆48Sep 4, 2025Updated 5 months ago
OPPO-Mente-Lab / DaMo
View on GitHub
The official implement of paper 《DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents》
☆29Oct 23, 2025Updated 4 months ago
wzhan24 / UniMate
View on GitHub
☆11Jun 22, 2025Updated 8 months ago
GradientHQ / symphony
View on GitHub
Symphony — A decentralized multi-agent framework that enables intelligent agents to collaborate seamlessly across heterogeneous edge devi…
☆30Oct 30, 2025Updated 4 months ago
akhilkedia / TranformersGetStable
View on GitHub
[ICML 2024] Official Repository for the paper "Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models"
☆10Jul 19, 2024Updated last year
furiosa-ai / ParallelBench
View on GitHub
[ICLR 2026] ParallelBench: Understanding the Tradeoffs of Parallel Decoding in Diffusion LLMs
☆30Updated this week
sani903 / OpenAgentSafety
View on GitHub
A Framework for Evaluating AI Agent Safety in Realistic Environments
☆30Oct 2, 2025Updated 4 months ago
PacktPublishing / Mastering-AI-Agents-for-Databases
View on GitHub
☆12Dec 15, 2025Updated 2 months ago
ZhangXJ199 / EDGE-GRPO
View on GitHub
Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity
☆22Aug 28, 2025Updated 6 months ago
smallporridge / TrustworthyRAG
View on GitHub
☆16Sep 17, 2024Updated last year
THU-KEG / PairJudgeRM
View on GitHub
☆14Apr 14, 2025Updated 10 months ago
zhangzef / COOPER
View on GitHub
The official implementation of COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence.
☆28Dec 30, 2025Updated 2 months ago
segev-shlomov / ST-WebAgentBench
View on GitHub
A Benchmark for Evaluating Safety and Trustworthiness in Web Agents for Enterprise Scenarios
☆19Feb 22, 2026Updated last week
snumprlab / hima
View on GitHub
Official Implementation of HIMA (COLM'25)
☆19Nov 25, 2025Updated 3 months ago
LianKee / SO-CVEs
View on GitHub
☆10Jun 5, 2023Updated 2 years ago
ZJU-REAL / cooper
View on GitHub
☆25Aug 19, 2025Updated 6 months ago
FredJiang0324 / MAMGA
View on GitHub
☆24Jan 8, 2026Updated last month
LunarShen / DsicoVLA
View on GitHub
[CVPR 2025] DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval
☆21Jun 23, 2025Updated 8 months ago
Longin-Yu / ComRoPE
View on GitHub
☆12Jun 11, 2025Updated 8 months ago
Leosang-lx / FlowSpec
View on GitHub
Continuous Pipelined Speculative Decoding
☆16Jan 4, 2026Updated last month
SEU-VIPGroup / Understanding_Vision_Tasks
View on GitHub
☆13Feb 2, 2025Updated last year
Zcchill / Value-Residual-Learning
View on GitHub
☆14Mar 20, 2025Updated 11 months ago
Princeton-AI2-Lab / ZoomClick
View on GitHub
A Practical Zoom-in GUI Grounding and Behavior-Based Evaluation method.
☆19Dec 8, 2025Updated 2 months ago