AlexCuadron / ThinkingAgentLinks
Systematic evaluation framework that automatically rates overthinking behavior in large language models.
☆90Updated last month
Alternatives and similar repositories for ThinkingAgent
Users that are interested in ThinkingAgent are comparing it to the libraries listed below
Sorting:
- Verifiers for LLM Reinforcement Learning☆60Updated 2 months ago
- 🚀 SWE-bench Goes Live!☆80Updated this week
- DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents☆135Updated last week
- ☆53Updated last week
- Process Reward Models That Think☆41Updated 3 weeks ago
- ☆32Updated last month
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment☆57Updated 9 months ago
- RL Scaling and Test-Time Scaling (ICML'25)☆106Updated 5 months ago
- ☆115Updated 4 months ago
- Open-Source LLM Coders with Co-Evolving Reinforcement Learning☆83Updated 2 weeks ago
- Official Repo for InSTA: Towards Internet-Scale Training For Agents☆42Updated this week
- Scaling Computer-Use Grounding via UI Decomposition and Synthesis☆79Updated last week
- ☆85Updated 7 months ago
- ☆36Updated 2 weeks ago
- Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory☆62Updated last month
- SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning☆57Updated 2 months ago
- ☆47Updated 3 weeks ago
- Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆76Updated 2 weeks ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- ☆35Updated 3 weeks ago
- Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate"☆159Updated 2 weeks ago
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆219Updated last month
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆57Updated 6 months ago
- DSBench: How Far are Data Science Agents from Becoming Data Science Experts?☆55Updated 4 months ago
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆173Updated 5 months ago
- Code for Paper: Learning Adaptive Parallel Reasoning with Language Models☆107Updated 2 months ago
- official implementation of paper "Process Reward Model with Q-value Rankings"☆59Updated 4 months ago
- B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners☆82Updated last month
- [ACL 2025] Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems☆93Updated 2 weeks ago
- Official Code Repository for the paper "Distilling LLM Agent into Small Models with Retrieval and Code Tools"☆104Updated 2 weeks ago