lmarena / p2l
Prompt-to-Leaderboard
☆218Updated last week
Alternatives and similar repositories for p2l:
Users that are interested in p2l are comparing it to the libraries listed below
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆177Updated last week
- Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"☆503Updated last month
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆171Updated 3 months ago
- A simple unified framework for evaluating LLMs☆209Updated last week
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]☆227Updated last month
- Systematic evaluation framework that automatically rates overthinking behavior in large language models.☆86Updated 2 weeks ago
- AWM: Agent Workflow Memory☆262Updated 2 months ago
- Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction☆280Updated last month
- ☆155Updated 7 months ago
- Building Open LLM Web Agents with Self-Evolving Online Curriculum RL☆363Updated 2 weeks ago
- ☆48Updated last week
- Official repo for paper: "Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't"☆199Updated last month
- CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction☆500Updated 2 months ago
- ☆297Updated 4 months ago
- ☆542Updated 2 weeks ago
- ☆119Updated 8 months ago
- Official code repository for Sketch-of-Thought (SoT)☆107Updated 3 weeks ago
- Official implementation of paper "On the Diagram of Thought" (https://arxiv.org/abs/2409.10038)☆178Updated 3 weeks ago
- Code for Husky, an open-source language agent that solves complex, multi-step reasoning tasks. Husky v1 addresses numerical, tabular and …☆341Updated 10 months ago
- [ICLR 2025] A trinity of environments, tools, and benchmarks for general virtual agents☆199Updated last week
- Multi-Faceted AI Agent and Workflow Autotuning. Automatically optimizes LangChain, LangGraph, DSPy programs for better quality, lower exe…☆222Updated 3 weeks ago
- Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate"☆137Updated 2 months ago
- ☆698Updated this week
- ☆145Updated last month
- SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning☆50Updated 2 weeks ago
- OS-ATLAS: A Foundation Action Model For Generalist GUI Agents☆322Updated this week
- ☆194Updated last month
- ☆122Updated last month
- Code and example data for the paper: Rule Based Rewards for Language Model Safety☆186Updated 9 months ago
- II-Researcher: a new open-source framework designed to aid building search / research agents☆238Updated last week