All-Hands-AI / trajectory-visualizerLinks
☆27Updated 3 weeks ago
Alternatives and similar repositories for trajectory-visualizer
Users that are interested in trajectory-visualizer are comparing it to the libraries listed below
Sorting:
- Reasoning by Communicating with Agents☆29Updated last month
- Agent computer interface for AI software engineer.☆85Updated this week
- Small, simple agent task environments for training and evaluation☆18Updated 7 months ago
- accompanying material for sleep-time compute paper☆95Updated last month
- ☆41Updated 4 months ago
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment☆57Updated 9 months ago
- ☆50Updated 3 weeks ago
- Run SWE-bench evaluations remotely☆21Updated last month
- 🔔🧠 Easily experiment with popular language agents across diverse reasoning/decision-making benchmarks!☆52Updated this week
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆58Updated 6 months ago
- ☆13Updated 3 months ago
- Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments (EMNLP'2024)☆37Updated 5 months ago
- ☆52Updated 2 weeks ago
- Computer Agent Arena: Test & compare AI agents in real desktop apps & web environments. Code/data coming soon!☆45Updated 2 months ago
- Harness used to benchmark aider against SWE Bench benchmarks☆72Updated last year
- Verifiers for LLM Reinforcement Learning☆60Updated 2 months ago
- A framework for pitting LLMs against each other in an evolving library of games ⚔☆32Updated 2 months ago
- ☆86Updated 3 weeks ago
- Official repo for Learning to Reason for Long-Form Story Generation☆63Updated 2 months ago
- ☆41Updated 2 weeks ago
- Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data☆38Updated 4 months ago
- Dataset and evaluation suite enabling LLM instruction-following for scientific literature understanding.☆40Updated 3 months ago
- Moatless Testbeds allows you to create isolated testbed environments in a Kubernetes cluster where you can apply code changes through git…☆13Updated 2 months ago
- ☆40Updated 11 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- Optimizing Causal LMs through GRPO with weighted reward functions and automated hyperparameter tuning using Optuna☆53Updated 4 months ago
- Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆77Updated 2 weeks ago
- Landing page + leaderboard for SWE-Bench benchmark☆6Updated 2 weeks ago
- ☆24Updated 9 months ago
- ☆65Updated 2 months ago