☆331Jun 19, 2024Updated last year
Alternatives and similar repositories for MLAgentBench
Users that are interested in MLAgentBench are comparing it to the libraries listed below
Sorting:
- Official implementation of "DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning" in ICML'24☆229Dec 3, 2024Updated last year
- [ICLR 2025] DSBench: How Far are Data Science Agents from Becoming Data Science Experts?☆111Aug 17, 2025Updated 7 months ago
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆1,381Updated this week
- AIDE: AI-Driven Exploration in the Space of Code. The machine Learning engineering agent that automates AI R&D.☆1,162Feb 12, 2026Updated last month
- Redwood Research's transformer interpretability tools☆15Apr 15, 2022Updated 3 years ago
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)☆3,253Feb 8, 2026Updated last month
- [EMNLP 2024 Findings] Benchmarking Language Model Agents for Data-Driven Science☆35Oct 25, 2024Updated last year
- ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels …☆285Aug 19, 2023Updated 2 years ago
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆132Mar 5, 2026Updated 2 weeks ago
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆243May 5, 2024Updated last year
- AIDE: the Machine Learning CodeGen Agent☆25Oct 7, 2024Updated last year
- Self-Alignment with Principle-Following Reward Models☆170Sep 18, 2025Updated 6 months ago
- ☆134Oct 16, 2025Updated 5 months ago
- ☆88Dec 15, 2023Updated 2 years ago
- [CoLM 24] Official Repository of MambaByte: Token-free Selective State Space Model☆24Oct 12, 2024Updated last year
- ☆67Mar 30, 2025Updated 11 months ago
- SmartPlay is a benchmark for Large Language Models (LLMs). Uses a variety of games to test various important LLM capabilities as agents. …☆147Apr 11, 2024Updated last year
- AgentTuning: Enabling Generalized Agent Abilities for LLMs☆1,483Oct 31, 2023Updated 2 years ago
- [NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?☆150Aug 26, 2024Updated last year
- [ICLR 2024] Lemur: Open Foundation Models for Language Agents☆557Oct 28, 2023Updated 2 years ago
- ☆19May 23, 2023Updated 2 years ago
- ☆13Jul 12, 2024Updated last year
- [ACL 2024] <Large Language Models for Automated Open-domain Scientific Hypotheses Discovery>. It has also received the best poster award …☆43Oct 28, 2024Updated last year
- [IJCAI 2024] Generate different roles for GPTs to form a collaborative entity for complex tasks.☆1,471Sep 9, 2025Updated 6 months ago
- [EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation☆49Dec 22, 2023Updated 2 years ago
- CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)☆73Jun 25, 2024Updated last year
- Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"☆475Mar 19, 2024Updated 2 years ago
- ☆189Jan 27, 2025Updated last year
- AdaPlanner: Language Models for Decision Making via Adaptive Planning from Feedback☆125Mar 31, 2025Updated 11 months ago
- [AAAI 2025] Augmenting Math Word Problems via Iterative Question Composing (https://arxiv.org/abs/2401.09003)☆23Oct 2, 2025Updated 5 months ago
- ☆2,890Feb 20, 2025Updated last year
- Mixing Language Models with Self-Verification and Meta-Verification☆112Dec 12, 2024Updated last year
- A curated list of papers on LLMs and agents for scientific research and development☆86Dec 11, 2024Updated last year
- [NeurIPS 2022] 🛒WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents☆507Sep 6, 2024Updated last year
- ☆39May 2, 2024Updated last year
- FireAct: Toward Language Agent Fine-tuning☆292Oct 22, 2023Updated 2 years ago
- ☆285Dec 4, 2024Updated last year
- [ICLR 2025] Automated Design of Agentic Systems☆1,540Jan 28, 2025Updated last year
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…☆133Jun 4, 2024Updated last year