Evergreen, contamination-free, real-world, domain-specific AI evaluation framework
☆127Jan 11, 2026Updated last month
Alternatives and similar repositories for xbench-evals
Users that are interested in xbench-evals are comparing it to the libraries listed below
Sorting:
- ☆144May 14, 2025Updated 9 months ago
- ☆20Updated this week
- Short RL☆18May 26, 2025Updated 9 months ago
- 🌟Official code of our AAAI26 paper 🔍WebFilter☆38Nov 9, 2025Updated 4 months ago
- a benckmark for evaluating logical reasoning of LLMs☆23Jan 25, 2024Updated 2 years ago
- Official Implementation of Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution☆71Dec 8, 2025Updated 3 months ago
- Circa (meaning ‘approximately’) dataset aims to help machine learning systems to solve the problem of interpreting indirect answers to po…☆20Oct 8, 2020Updated 5 years ago
- Syntax Error-Free and Generalizable Tool Use for LLMs via Finite-State Decoding☆28Jan 28, 2024Updated 2 years ago
- [arxiv: 2512.19673] Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies☆61Feb 6, 2026Updated last month
- [R]einforcement [L]earning from [M]odel-rewarded [T]hinking - code for the paper "Language Models That Think, Chat Better"☆124Oct 27, 2025Updated 4 months ago
- ☆74May 30, 2025Updated 9 months ago
- An implementation of ROUGE family metrics for automatic summarization.☆24Jan 7, 2023Updated 3 years ago
- A Conversational Information Seeking (CIS) Paper Reading List Maintained by Chuan Meng.☆29Sep 27, 2022Updated 3 years ago
- Scaling Deep Research via Reinforcement Learning in Real-world Environments.☆709Oct 15, 2025Updated 4 months ago
- A collection of research papers on low-precision training methods☆64May 10, 2025Updated 10 months ago
- DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents☆614Updated this week
- homework in SCUT_SE☆12Nov 9, 2021Updated 4 years ago
- ☆10Dec 8, 2022Updated 3 years ago
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆263May 5, 2025Updated 10 months ago
- 🔍 Awesome Agentic Search is a curated list of papers, tools, and resources on agentic search—where AI agents plan, search, and reason to…☆54Aug 28, 2025Updated 6 months ago
- ☆72Jun 10, 2025Updated 9 months ago
- B站爬虫☆15Dec 10, 2023Updated 2 years ago
- ☆13Nov 5, 2024Updated last year
- 桂林电子科技大学Evolution战队2021雷达站视觉代码开源☆11Sep 3, 2021Updated 4 years ago
- Logiqa2.0 dataset - logical reasoning in MRC and NLI tasks☆102Aug 11, 2023Updated 2 years ago
- The official implementation of "LevelRAG: Enhancing Retrieval-Augmented Generation with Multi-hop Logic Planning over Rewriting Augmented…☆50Apr 12, 2025Updated 10 months ago
- [Findings of ACL'2023] Improving Contrastive Learning of Sentence Embeddings from AI Feedback☆40Aug 14, 2023Updated 2 years ago
- Implementation for OAgents: An Empirical Study of Building Effective Agents☆310Oct 13, 2025Updated 4 months ago
- Demo code for the paper "Discrete Optimization for Shape Matching"☆12Jun 25, 2021Updated 4 years ago
- LOLA: Large and Open Source Multilingual Language Model☆11Jan 22, 2026Updated last month
- Regularized latent variable mixed membership modeling☆13Aug 12, 2013Updated 12 years ago
- allowing R users to work with dlib through Rcpp☆13Apr 11, 2018Updated 7 years ago
- FamilyTool benchmark☆12Sep 10, 2025Updated 6 months ago
- Testing sets for semanticVAD☆20Feb 18, 2025Updated last year
- ☆26Jul 29, 2025Updated 7 months ago
- Implementation for EACL 2024 paper "Corpus-Steered Query Expansion with Large Language Models"☆12Mar 19, 2024Updated last year
- Please visit https://github.com/HKUSTDial/NL2SQL360 to get the official code!☆10Sep 1, 2024Updated last year
- ☆15Dec 2, 2025Updated 3 months ago
- From barbarism to civilization requires a century; from civilization to barbarism needs but a day.☆12Jul 17, 2023Updated 2 years ago