tensorzero / llmgymLinks
☆29Updated last month
Alternatives and similar repositories for llmgym
Users that are interested in llmgym are comparing it to the libraries listed below
Sorting:
- Curated collection of community environments☆200Updated this week
- ☆235Updated last week
- ☆116Updated last week
- Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike stat…☆411Updated last month
- Library for text-to-text regression, applicable to any input string representation and allows pretraining and fine-tuning over multiple r…☆305Updated 3 weeks ago
- A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning☆334Updated 2 months ago
- Storing long contexts in tiny caches with self-study☆229Updated last month
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆189Updated 10 months ago
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆402Updated last week
- Training API and CLI☆305Updated 3 weeks ago
- ☆213Updated 2 weeks ago
- Async RL Training at Scale☆985Updated this week
- rl from zero pretrain, can it be done? yes.☆286Updated 3 months ago
- PyTorch-native post-training at scale☆585Updated this week
- GRPO training code which scales to 32xH100s for long horizon terminal/coding tasks. Base agent is now the top Qwen3 agent on Stanford's T…☆323Updated 4 months ago
- An interface library for RL post training with environments.☆973Updated this week
- ☆32Updated 7 months ago
- ☆59Updated 11 months ago
- Public repository containing METR's DVC pipeline for eval data analysis☆174Updated 9 months ago
- ☆67Updated 6 months ago
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆128Updated 2 months ago
- Collection of evals for Inspect AI☆332Updated this week
- TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycle☆302Updated 3 weeks ago
- ⚖️ Awesome LLM Judges ⚖️☆148Updated 8 months ago
- Inference-time scaling for LLMs-as-a-judge.☆320Updated 2 months ago
- Harbor is a framework for running agent evaluations and creating and using RL environments.☆306Updated this week
- ☆127Updated 2 months ago
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents☆509Updated this week
- A framework for optimizing DSPy programs with RL☆303Updated this week
- ☆113Updated 3 months ago