snap-stanford/MLAgentBench

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/snap-stanford/MLAgentBench)

snap-stanford / MLAgentBench

☆346

Alternatives and similar repositories for MLAgentBench

Users that are interested in MLAgentBench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

guosyjlu / DS-Agent
View on GitHub
Official implementation of "DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning" in ICML'24
☆238Dec 3, 2024Updated last year
WecoAI / aideml
View on GitHub
AIDE: AI-Driven Exploration in the Space of Code. The machine Learning engineering agent that automates AI R&D.
☆1,431Updated this week
openai / mle-bench
View on GitHub
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
☆1,647Apr 24, 2026Updated 2 months ago
redwoodresearch / interp
View on GitHub
Redwood Research's transformer interpretability tools
☆15Apr 15, 2022Updated 4 years ago
behavioral-data / BLADE
View on GitHub
[EMNLP 2024 Findings] Benchmarking Language Model Agents for Data-Driven Science
☆35Oct 25, 2024Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
LiqiangJing / DSBench
View on GitHub
[ICLR 2025] DSBench: How Far are Data Science Agents from Becoming Data Science Experts?
☆125Aug 17, 2025Updated 11 months ago
OSU-NLP-Group / ScienceAgentBench
View on GitHub
[ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
☆149Updated this week
THUDM / AgentBench
View on GitHub
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
☆3,586Feb 8, 2026Updated 5 months ago
night-chen / ToolQA
View on GitHub
ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels …
☆286Aug 19, 2023Updated 2 years ago
princeton-nlp / intercode
View on GitHub
[NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898
☆252May 5, 2024Updated 2 years ago
IBM / SALMON
View on GitHub
Self-Alignment with Principle-Following Reward Models
☆170Sep 18, 2025Updated 10 months ago
allenai / clin
View on GitHub
☆89Dec 15, 2023Updated 2 years ago
jxiw / MambaByte
View on GitHub
[CoLM 24] Official Repository of MambaByte: Token-free Selective State Space Model
☆27Oct 12, 2024Updated last year
OpenLemur / Lemur
View on GitHub
[ICLR 2024] Lemur: Open Foundation Models for Language Agents
☆557Oct 28, 2023Updated 2 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
du-nlp-lab / MLR-Copilot
View on GitHub
☆70Mar 30, 2025Updated last year
weaviate / how-to-ingest-pdfs-with-unstructured
View on GitHub
☆19May 23, 2023Updated 3 years ago
THUDM / AgentTuning
View on GitHub
AgentTuning: Enabling Generalized Agent Abilities for LLMs
☆1,500Oct 31, 2023Updated 2 years ago
xlang-ai / Spider2-V
View on GitHub
[NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
☆153Aug 26, 2024Updated last year
shoggoth13 / agents-deconstructed
View on GitHub
☆55Sep 9, 2023Updated 2 years ago
thesofakillers / aideml
View on GitHub
AIDE: the Machine Learning CodeGen Agent
☆25Oct 7, 2024Updated last year
microsoft / SmartPlay
View on GitHub
SmartPlay is a benchmark for Large Language Models (LLMs). Uses a variety of games to test various important LLM capabilities as agents. …
☆146Apr 11, 2024Updated 2 years ago
Cadenza-Labs / sleeper-agents
View on GitHub
☆15Jul 12, 2024Updated 2 years ago
martin-wey / CodeUltraFeedback
View on GitHub
CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)
☆76Jun 25, 2024Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
zorazrw / odex
View on GitHub
[EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation
☆49Dec 22, 2023Updated 2 years ago
ZonglinY / MOOSE
View on GitHub
[ACL 2024] <Large Language Models for Automated Open-domain Scientific Hypotheses Discovery>. It has also received the best poster award …
☆45Oct 28, 2024Updated last year
Link-AGI / AutoAgents
View on GitHub
[IJCAI 2024] Generate different roles for GPTs to form a collaborative entity for complex tasks.
☆1,490Sep 9, 2025Updated 10 months ago
salesforce / BOLAA
View on GitHub
☆192Jun 2, 2026Updated last month
allenai / lumos
View on GitHub
Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"
☆478Mar 19, 2024Updated 2 years ago
iiis-ai / IterativeQuestionComposing
View on GitHub
[AAAI 2025] Augmenting Math Word Problems via Iterative Question Composing (https://arxiv.org/abs/2401.09003)
☆23Oct 2, 2025Updated 9 months ago
haotiansun14 / AdaPlanner
View on GitHub
AdaPlanner: Language Models for Decision Making via Adaptive Planning from Feedback
☆125Mar 31, 2025Updated last year
METR / RE-Bench
View on GitHub
☆144Oct 16, 2025Updated 9 months ago
automix-llm / automix
View on GitHub
Mixing Language Models with Self-Verification and Meta-Verification
☆116Dec 12, 2024Updated last year
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
Paitesanshi / LLM-Agent-Survey
View on GitHub
☆2,908Feb 20, 2025Updated last year
weirayao / Retroformer
View on GitHub
☆39May 2, 2024Updated 2 years ago
noahho / CAAFE
View on GitHub
Semi-automatic feature engineering process using Language Models and your dataset descriptions. Based on the paper "LLMs for Semi-Automat…
☆195Dec 20, 2024Updated last year
anchen1011 / FireAct
View on GitHub
FireAct: Toward Language Agent Fine-tuning
☆296Oct 22, 2023Updated 2 years ago
OSU-NLP-Group / awesome-agents4science
View on GitHub
A curated list of papers on LLMs and agents for scientific research and development
☆96Dec 11, 2024Updated last year
MetaCopilot / dseval
View on GitHub
☆33Jun 24, 2024Updated 2 years ago
MLE-Dojo / MLE-Dojo
View on GitHub
☆99Oct 30, 2025Updated 8 months ago