xbench-ai / xbench-evalsLinks
Evergreen, contamination-free, real-world, domain-specific AI evaluation framework
☆86Updated 2 months ago
Alternatives and similar repositories for xbench-evals
Users that are interested in xbench-evals are comparing it to the libraries listed below
Sorting:
- ☆102Updated 4 months ago
- SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis☆103Updated 3 months ago
- AutoCoA (Automatic generation of Chain-of-Action) is an agent model framework that enhances the multi-turn tool usage capability of reaso…☆125Updated 5 months ago
- ☆159Updated 7 months ago
- ☆73Updated 7 months ago
- ☆98Updated last year
- Open Source Implementation of Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evo…☆83Updated last month
- Hammer: Robust Function-Calling for On-Device Language Models via Function Masking☆100Updated 3 months ago
- Scaling Preference Data Curation via Human-AI Synergy☆106Updated 2 months ago
- ☆96Updated 9 months ago
- ☆109Updated 3 weeks ago
- ☆147Updated last year
- Deep Research Agent CognitiveKernel-Pro from Tencent AI Lab. Paper: https://arxiv.org/pdf/2508.00414☆336Updated 3 weeks ago
- IKEA: Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent☆64Updated 4 months ago
- ☆62Updated this week
- Benchmarking Complex Instruction-Following with Multiple Constraints Composition (NeurIPS 2024 Datasets and Benchmarks Track)☆92Updated 6 months ago
- ☆63Updated 4 months ago
- WritingBench: A Comprehensive Benchmark for Generative Writing☆115Updated last week
- ☆330Updated 3 months ago
- a-m-team's exploration in large language modeling☆187Updated 3 months ago
- Awesome Deep Research list! For more details, please refer to our survey paper -- A Comprehensive Survey of Deep Research: Systems, Metho…☆316Updated 2 weeks ago
- Scaling Deep Research via Reinforcement Learning in Real-world Environments.☆589Updated 5 months ago
- ☆163Updated 4 months ago
- ☆89Updated 4 months ago
- Build, manage, and scale your AI agents with ease.☆451Updated last week
- [COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios☆69Updated 4 months ago
- ☆73Updated 3 months ago
- Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models (…☆330Updated last week
- [ICLR 2025] The official implementation of paper "ToolGen: Unified Tool Retrieval and Calling via Generation"☆158Updated 5 months ago
- A visuailzation tool to make deep understaning and easier debugging for RLHF training.☆250Updated 6 months ago