THUDM / DataSciBenchLinks
DataSciBench: An LLM Agent Benchmark for Data Science
☆48Updated 4 months ago
Alternatives and similar repositories for DataSciBench
Users that are interested in DataSciBench are comparing it to the libraries listed below
Sorting:
- [ICLR 2025] DSBench: How Far are Data Science Agents from Becoming Data Science Experts?☆98Updated 4 months ago
- ☆46Updated 7 months ago
- ☆53Updated 10 months ago
- official implementation of paper "Process Reward Model with Q-value Rankings"☆65Updated 11 months ago
- ☆31Updated last year
- [NAACL 2025] The official implementation of paper "Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language M…☆29Updated last year
- ☆20Updated 4 months ago
- RL Scaling and Test-Time Scaling (ICML'25)☆112Updated 11 months ago
- ☆88Updated 2 months ago
- [ACL 2025] Are Your LLMs Capable of Stable Reasoning?☆32Updated 5 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆43Updated last year
- [ICML 2025] Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples☆113Updated 5 months ago
- ☆24Updated 9 months ago
- Process Reward Models That Think☆70Updated last month
- [ICML 2025] Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment (https://arxiv.org/abs/2410.02197)☆38Updated 4 months ago
- Source code of paper: Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning☆45Updated 6 months ago
- Evaluate the Quality of Critique☆36Updated last year
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆118Updated 4 months ago
- Open source code of the paper: "OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain"☆80Updated last year
- WideSearch: Benchmarking Agentic Broad Info-Seeking☆110Updated 3 months ago
- ☆110Updated 8 months ago
- ☆50Updated 10 months ago
- [EMNLP 2025] WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning☆64Updated 2 months ago
- Interpretable Contrastive Monte Carlo Tree Search Reasoning☆48Updated last year
- ☆104Updated last year
- Codebase for Instruction Following without Instruction Tuning☆36Updated last year
- SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning (NeurIPS D&B Track 2024)☆86Updated last year
- Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner☆29Updated last year
- [EMNLP 2024] RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning☆14Updated 7 months ago
- [ICLR'24 spotlight] Tool-Augmented Reward Modeling☆52Updated 7 months ago