BigComputer-Project / SWE-Arena
SWE Arena
☆31Updated last week
Alternatives and similar repositories for SWE-Arena:
Users that are interested in SWE-Arena are comparing it to the libraries listed below
- InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw☆59Updated 6 months ago
- ☆114Updated 2 months ago
- Flow of Reasoning: Training LLMs for Divergent Problem Solving with Minimal Examples☆84Updated 3 weeks ago
- ☆48Updated last week
- Replicating O1 inference-time scaling laws☆83Updated 4 months ago
- A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models☆47Updated last month
- A simple unified framework for evaluating LLMs☆209Updated last week
- ☆33Updated 3 weeks ago
- ☆70Updated 5 months ago
- ☆166Updated this week
- Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆61Updated 2 weeks ago
- Complex Function Calling Benchmark.☆96Updated 3 months ago
- Repository for the paper Stream of Search: Learning to Search in Language☆144Updated 2 months ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆220Updated 5 months ago
- RepoQA: Evaluating Long-Context Code Understanding☆107Updated 5 months ago
- Astraios: Parameter-Efficient Instruction Tuning Code Language Models☆57Updated last year
- ☆51Updated this week
- ☆60Updated this week
- Systematic evaluation framework that automatically rates overthinking behavior in large language models.☆86Updated 2 weeks ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆133Updated 5 months ago
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆177Updated last week
- accompany material for sleep time compute paper☆17Updated this week
- EvaByte: Efficient Byte-level Language Models at Scale☆87Updated last month
- The first dense retrieval model that can be prompted like an LM☆70Updated 7 months ago
- Code for Paper: Teaching Language Models to Critique via Reinforcement Learning☆92Updated last week
- The official evaluation suite and dynamic data release for MixEval.☆235Updated 5 months ago
- ☆57Updated last month
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆171Updated 3 months ago
- ☆60Updated 11 months ago