☆330Jun 19, 2024Updated last year
Alternatives and similar repositories for MLAgentBench
Users that are interested in MLAgentBench are comparing it to the libraries listed below
Sorting:
- Official implementation of "DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning" in ICML'24☆226Dec 3, 2024Updated last year
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆1,329Updated this week
- AIDE: AI-Driven Exploration in the Space of Code. The machine Learning engineering agent that automates AI R&D.☆1,140Feb 12, 2026Updated 2 weeks ago
- [ICLR 2025] DSBench: How Far are Data Science Agents from Becoming Data Science Experts?☆106Aug 17, 2025Updated 6 months ago
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆241May 5, 2024Updated last year
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)☆3,187Feb 8, 2026Updated 3 weeks ago
- ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels …☆286Aug 19, 2023Updated 2 years ago
- [EMNLP 2024 Findings] Benchmarking Language Model Agents for Data-Driven Science☆34Oct 25, 2024Updated last year
- ☆56Sep 9, 2023Updated 2 years ago
- Mixing Language Models with Self-Verification and Meta-Verification☆112Dec 12, 2024Updated last year
- CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)☆73Jun 25, 2024Updated last year
- ☆87Dec 15, 2023Updated 2 years ago
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆124Aug 26, 2025Updated 6 months ago
- SmartPlay is a benchmark for Large Language Models (LLMs). Uses a variety of games to test various important LLM capabilities as agents. …☆146Apr 11, 2024Updated last year
- Self-Alignment with Principle-Following Reward Models☆169Sep 18, 2025Updated 5 months ago
- [IJCAI 2024] Generate different roles for GPTs to form a collaborative entity for complex tasks.☆1,466Sep 9, 2025Updated 5 months ago
- ☆67Mar 30, 2025Updated 11 months ago
- AgentTuning: Enabling Generalized Agent Abilities for LLMs☆1,477Oct 31, 2023Updated 2 years ago
- [ICLR 2024] Lemur: Open Foundation Models for Language Agents☆555Oct 28, 2023Updated 2 years ago
- ☆133Oct 16, 2025Updated 4 months ago
- [CoLM 24] Official Repository of MambaByte: Token-free Selective State Space Model☆24Oct 12, 2024Updated last year
- FireAct: Toward Language Agent Fine-tuning☆292Oct 22, 2023Updated 2 years ago
- Interactive coding assistant for data scientists and machine learning developers, empowered by large language models.☆99Oct 8, 2024Updated last year
- [ACL 2024] <Large Language Models for Automated Open-domain Scientific Hypotheses Discovery>. It has also received the best poster award …☆42Oct 28, 2024Updated last year
- ☆188Jan 27, 2025Updated last year
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…☆133Jun 4, 2024Updated last year
- ☆2,882Feb 20, 2025Updated last year
- [AAAI 2025] Augmenting Math Word Problems via Iterative Question Composing (https://arxiv.org/abs/2401.09003)☆23Oct 2, 2025Updated 4 months ago
- AIDE: the Machine Learning CodeGen Agent☆25Oct 7, 2024Updated last year
- Visual RAG using less than 300 lines of code.☆30Mar 2, 2024Updated 2 years ago
- [NeurIPS 2022] 🛒WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents☆488Sep 6, 2024Updated last year
- Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"☆1,353Nov 26, 2025Updated 3 months ago
- Open-source repository for the OOPSLA'24 paper "CYCLE: Learning to Self-Refine Code Generation"☆10Mar 8, 2024Updated last year
- ☆11Mar 13, 2023Updated 2 years ago
- ☆11Jan 3, 2024Updated 2 years ago
- A library for advanced large language model reasoning☆2,333Jun 10, 2025Updated 8 months ago
- [ICLR 2025] Automated Design of Agentic Systems☆1,521Jan 28, 2025Updated last year
- [EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation☆49Dec 22, 2023Updated 2 years ago
- Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"☆473Mar 19, 2024Updated last year