LiveBench / liveswebenchLinks
☆61Updated 10 months ago
Alternatives and similar repositories for liveswebench
Users that are interested in liveswebench are comparing it to the libraries listed below
Sorting:
- ☆132Updated 8 months ago
- ☆131Updated 9 months ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆245Updated last year
- [ACL25' Findings] SWE-Dev is an SWE agent with a scalable test case construction pipeline.☆58Updated 6 months ago
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents☆538Updated this week
- A simple unified framework for evaluating LLMs☆261Updated 9 months ago
- RepoQA: Evaluating Long-Context Code Understanding☆128Updated last year
- LOFT: A 1 Million+ Token Long-Context Benchmark☆225Updated 7 months ago
- [COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆230Updated 6 months ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆146Updated last year
- A Comprehensive Benchmark for Software Development.☆127Updated last year
- ☆80Updated 10 months ago
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆430Updated this week
- [NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live!☆161Updated last week
- ☆313Updated last year
- ☆56Updated last year
- CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings☆65Updated last year
- Reproducible, flexible LLM evaluations☆337Updated last week
- The code for the paper: "Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models"☆56Updated 3 months ago
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]☆335Updated 2 months ago
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆246Updated last week
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]☆110Updated 11 months ago
- [ACL 2025 Findings] Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts (As Huggingface Daily Papers: …☆90Updated 2 months ago
- [ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI☆475Updated last month
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following☆136Updated last year
- Complex Function Calling Benchmark.☆165Updated last year
- BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.☆238Updated 5 months ago
- ☆41Updated 10 months ago
- The official evaluation suite and dynamic data release for MixEval.☆255Updated last year
- WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting.☆62Updated last month