ruleGreen / AppBenchLinks

This is for EMNLP 2024 Paper: AppBench: Planning of Multiple APIs from Various APPs for Complex User Instruction

☆13

Alternatives and similar repositories for AppBench

Users that are interested in AppBench are comparing it to the libraries listed below

Sorting:

WeiminXiong / IPR
Watch Every Step! LLM Agent Learning via Iterative Step-level Process Refinement (EMNLP 2024 Main Conference)
☆57Updated 7 months ago
rookie-joe / AutoPSV
☆46Updated 7 months ago
ChengpengLi1003 / DotaMath
☆29Updated 5 months ago
jinzhuoran / RAG-RewardBench
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment
☆16Updated 5 months ago
icip-cas / Verifier-Engineering
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
☆59Updated 6 months ago
KbsdJames / omni-math-rule
The rule-based evaluation subset and code implementation of Omni-MATH
☆22Updated 5 months ago
Reason-Wang / NAT
[NAACL 2025] The official implementation of paper "Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language M…
☆26Updated last year
iQua / llmpebase
This is a unified platform for implementing and evaluating test-time reasoning mechanisms in Large Language Models (LLMs).
☆18Updated 4 months ago
MingyuJ666 / The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
[ACL'24] Chain of Thought (CoT) is significant in improving the reasoning abilities of large language models (LLMs). However, the correla…
☆46Updated 3 weeks ago
YiCheng98 / IntegrativeDecoding
Official Implementation for the paper "Integrative Decoding: Improving Factuality via Implicit Self-consistency"
☆26Updated last month
GAIR-NLP / MetaCritique
Evaluate the Quality of Critique
☆35Updated last year
starrYYxuan / LeCo
This the implementation of LeCo
☆31Updated 4 months ago
GAIR-NLP / ReasonEval
[AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy
☆61Updated 5 months ago
THUNLP-MT / SKR
Self-Knowledge Guided Retrieval Augmentation for Large Language Models (EMNLP Findings 2023)
☆26Updated last year
halfrot / ALaRM
[ACL 2024] Code for the paper "ALaRM: Align Language Models via Hierarchical Rewards Modeling"
☆25Updated last year
RUCAIBox / RLMEC
The official repository of "Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint"
☆38Updated last year
zhaochen0110 / Cotempqa
Code and data for "Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning?" (ACL 2024)
☆32Updated 11 months ago
HKUNLP / critic-rl
[ICML 2025] Teaching Language Models to Critique via Reinforcement Learning
☆98Updated last month
ernie-research / Tool-Augmented-Reward-Model
[ICLR'24 spotlight] Tool-Augmented Reward Modeling
☆50Updated 5 months ago
PremiLab-Math / MathCheck
[ICLR 2025] Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist
☆32Updated 7 months ago
ToolBeHonest / ToolBeHonest
[EMNLP 2024] A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models.
☆17Updated 8 months ago
mathllm / Step-Controlled_DPO
☆22Updated 11 months ago
OSU-NLP-Group / llm-planning-eval
[ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"
☆54Updated last year
bobxwu / learning-from-rewards-llm-papers
This repository collects research papers on learning from rewards in the context of post-training and test-time scaling of large language…
☆37Updated 3 weeks ago
yyDing1 / ScaleQuest
[ACL-25] We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLMs.
☆63Updated 7 months ago
GAIR-NLP / weak-to-strong-reasoning
☆59Updated 9 months ago
qtli / GSM-Plus
GSM-Plus: Data, Code, and Evaluation for Enhancing Robust Mathematical Reasoning in Math Word Problems.
☆62Updated 10 months ago
hkust-nlp / GUIMid
☆18Updated last month
hanxuhu / SeqIns
The repository of the project "Fine-tuning Large Language Models with Sequential Instructions", code base comes from open-instruct and LA…
☆29Updated 6 months ago
RUCKBReasoning / CoT-based-Synthesizer
Official code implementation for the ACL 2025 paper: 'CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis'
☆27Updated 2 weeks ago