zlwang-cs / OfficeBench
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
☆12Updated 2 months ago
Alternatives and similar repositories for OfficeBench
Users that are interested in OfficeBench are comparing it to the libraries listed below
Sorting:
- Implementation of AdaCQR(COLING 2025)☆10Updated 4 months ago
- Code for "Reasoning to Learn from Latent Thoughts"☆94Updated last month
- AdaRFT: Efficient Reinforcement Finetuning via Adaptive Curriculum Learning☆34Updated last week
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"☆179Updated 2 months ago
- GenRM-CoT: Data release for verification rationales☆60Updated 7 months ago
- Watch Every Step! LLM Agent Learning via Iterative Step-level Process Refinement (EMNLP 2024 Main Conference)☆57Updated 7 months ago
- The rule-based evaluation subset and code implementation of Omni-MATH☆21Updated 4 months ago
- ☆45Updated last month
- ☆59Updated 8 months ago
- [ICLR2025 Spotlight] Agent Trajectory Synthesis via Guiding Replay with Web Tutorials☆31Updated 2 months ago
- Repo of paper "Free Process Rewards without Process Labels"☆147Updated 2 months ago
- [AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy☆61Updated 5 months ago
- Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision☆120Updated 8 months ago
- Code for Paper: Learning Adaptive Parallel Reasoning with Language Models☆81Updated 3 weeks ago
- [ICML 2025] Teaching Language Models to Critique via Reinforcement Learning☆95Updated last week
- An Illusion of Progress? Assessing the Current State of Web Agents☆45Updated this week
- ☆165Updated last month
- ☆13Updated 10 months ago
- General Reasoner: Advancing LLM Reasoning Across All Domains☆82Updated last week
- ☆151Updated 5 months ago
- ☆18Updated 2 weeks ago
- ☆44Updated 9 months ago
- ☆61Updated last month
- ☆34Updated this week
- Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆67Updated 3 weeks ago
- ☆36Updated last month
- Code for Paper: Autonomous Evaluation and Refinement of Digital Agents [COLM 2024]☆135Updated 5 months ago
- A new dataset of difficult graduate-level applied mathematics problems; evaluations demonstrate that leading LLMs currently exhibit low a…☆17Updated 3 months ago
- Collections of RLxLM experiments using minimal codes☆12Updated 3 months ago
- Code and data used in the paper: "Training on Incorrect Synthetic Data via RL Scales LLM Math Reasoning Eight-Fold"☆30Updated 11 months ago