THU-KEG / AgentIFLinks
[NIPS 2025 DB Spotlight] AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios
☆23Updated last month
Alternatives and similar repositories for AgentIF
Users that are interested in AgentIF are comparing it to the libraries listed below
Sorting:
- The demo, code and data of FollowRAG☆75Updated 6 months ago
- [EMNLP 2024 (Oral)] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA☆144Updated 2 weeks ago
- [ICLR 2025] This is the code repo for our ICLR’25 paper "RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rew…☆50Updated 11 months ago
- [COLM'25] Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?☆36Updated 7 months ago
- ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation☆55Updated 2 months ago
- BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent☆147Updated last month
- Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"☆181Updated 7 months ago
- NeurIPS 2025: Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs☆63Updated last month
- ☆37Updated last month
- [EMNLP 2024] The official GitHub repo for the survey paper "Knowledge Conflicts for LLMs: A Survey"☆150Updated last year
- Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs (ACL 2024)☆73Updated 8 months ago
- [ICLR 2025] BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval☆182Updated 3 months ago
- SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis☆115Updated 7 months ago
- This is the code repo for the paper "RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards".☆23Updated last year
- The official repo for our paper: LegalAgentBench: Evaluating LLM Agents in Legal Domainl☆36Updated last year
- ☆24Updated 2 years ago
- Source code of DRAGIN, ACL 2024 main conference Long Paper (Oral)☆182Updated last month
- [NeurIPS 2025] Implementation for the paper "The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning"☆146Updated 2 months ago
- Code, benchmark and environment for "ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows"☆117Updated last month
- Official Implementation of "Probing Language Models for Pre-training Data Detection"☆20Updated last year
- [CIKM 2025] Constraint Back-translation Improves Complex Instruction Following of Large Language Models☆17Updated 7 months ago
- The code and data of DPA-RAG, accepted by WWW 2025 main conference.☆63Updated 2 months ago
- Official repository for RAG-Gym☆117Updated 10 months ago
- [EMNLP 2025] LightThinker: Thinking Step-by-Step Compression☆127Updated 8 months ago
- CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation☆65Updated 7 months ago
- [ICLR'24 Spotlight] "Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts"☆82Updated last year
- Open source code of the paper: "OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain"☆80Updated last year
- This is the code of MMOA-RAG.☆98Updated 8 months ago
- xVerify: Efficient Answer Verifier for Reasoning Model Evaluations☆142Updated last month
- [EMNLP 2024] Source code for the paper "Learning Planning-based Reasoning with Trajectory Collection and Process Rewards Synthesizing".☆83Updated 11 months ago