AlexCuadron / ThinkingAgent
Systematic evaluation framework that automatically rates overthinking behavior in large language models.
☆86Updated 2 weeks ago
Alternatives and similar repositories for ThinkingAgent:
Users that are interested in ThinkingAgent are comparing it to the libraries listed below
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- ☆70Updated 5 months ago
- Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems☆84Updated last month
- ☆114Updated 2 months ago
- ☆83Updated 2 months ago
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling☆101Updated 3 months ago
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆177Updated last week
- ☆48Updated last week
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment☆55Updated 7 months ago
- Repo for "Z1: Efficient Test-time Scaling with Code"☆53Updated last week
- ☆37Updated 2 months ago
- SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning☆50Updated 2 weeks ago
- ☆45Updated last month
- RepoQA: Evaluating Long-Context Code Understanding☆107Updated 5 months ago
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆168Updated last month
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆54Updated 4 months ago
- Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate"☆137Updated 2 months ago
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆170Updated 3 months ago
- Flow of Reasoning: Training LLMs for Divergent Problem Solving with Minimal Examples☆84Updated 3 weeks ago
- DSBench: How Far are Data Science Agents from Becoming Data Science Experts?☆50Updated 2 months ago
- Code for RATIONALYST: Pre-training Process-Supervision for Improving Reasoning https://arxiv.org/pdf/2410.01044☆32Updated 6 months ago
- Replicating O1 inference-time scaling laws☆83Updated 4 months ago
- Complex Function Calling Benchmark.☆96Updated 3 months ago
- SWE Arena☆31Updated last week
- Code for Paper: Teaching Language Models to Critique via Reinforcement Learning☆92Updated last week
- Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆61Updated 2 weeks ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆70Updated 3 weeks ago
- The official repository for SkyLadder: Better and Faster Pretraining via Context Window Scheduling☆29Updated last month
- Code for EMNLP 2024 paper "Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning"☆53Updated 6 months ago
- Agentic Knowledgeable Self-awareness☆47Updated last week