ServiceNow / insight-bench
☆39Updated 2 months ago
Alternatives and similar repositories for insight-bench:
Users that are interested in insight-bench are comparing it to the libraries listed below
- A banchmark list for evaluation of large language models.☆99Updated last month
- Public code repo for paper "SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales"☆104Updated 6 months ago
- Flow of Reasoning: Training LLMs for Divergent Problem Solving with Minimal Examples☆84Updated last month
- Code for MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents https://www.arxiv.org/pdf/2503.01935☆96Updated last month
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆81Updated 2 weeks ago
- Code for paper Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding☆63Updated 10 months ago
- DSBench: How Far are Data Science Agents from Becoming Data Science Experts?☆50Updated 2 months ago
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆103Updated last year
- [ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"☆54Updated last year
- augmented LLM with self reflection☆119Updated last year
- Codes and datasets for the paper Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Ref…☆48Updated last month
- Data and code for the Corr2Cause paper (ICLR 2024)☆96Updated last year
- ☆36Updated 3 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- ☆107Updated 3 months ago
- ☆70Updated 5 months ago
- ☆125Updated this week
- ☆37Updated 7 months ago
- BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval☆99Updated last week
- Codebase accompanying the Summary of a Haystack paper.☆77Updated 7 months ago
- DocBench: A Benchmark for Evaluating LLM-based Document Reading Systems☆30Updated 6 months ago
- ☆221Updated 8 months ago
- ☆22Updated 10 months ago
- This repo contains code for paper: "Uncertainty Estimation and Quantification for LLMs: A Simple Supervised Approach".☆16Updated 6 months ago
- 🤝 The code for "Can Large Language Model Agents Simulate Human Trust Behaviors?"☆77Updated 2 weeks ago
- Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments (EMNLP'2024)☆36Updated 3 months ago
- Systematic evaluation framework that automatically rates overthinking behavior in large language models.☆86Updated 2 weeks ago
- ☆64Updated this week
- [NeurIPS 2024] Agent Planning with World Knowledge Model☆126Updated 4 months ago
- ☆120Updated 6 months ago