lmarena / copilot-arenaLinks
β352Updated 2 months ago
Alternatives and similar repositories for copilot-arena
Users that are interested in copilot-arena are comparing it to the libraries listed below
Sorting:
- Coding problems used in aider's polyglot benchmarkβ199Updated last year
- DevQualityEval: An evaluation benchmark π and framework to compare and evolve the quality of code generation of LLMs.β185Updated 8 months ago
- Together Open Deep Researchβ358Updated 9 months ago
- Verify Precision of all Kimi K2 API Vendorβ507Updated 2 weeks ago
- A comprehensive set of LLM benchmark scores and provider prices. (deprecated, read more in README)β359Updated 3 months ago
- A system that tries to resolve all issues on a github repo with OpenHands.β117Updated last year
- Agent computer interface for AI software engineer.β116Updated 2 months ago
- β135Updated 9 months ago
- Open-source resources on agents for computer use.β398Updated 4 months ago
- Building open version of OpenAI o1 via reasoning traces (Groq, ollama, Anthropic, Gemini, OpenAI, Azure supported) Demo: https://huggingβ¦β188Updated last year
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.β430Updated last week
- Public repository containing METR's DVC pipeline for eval data analysisβ199Updated last week
- β434Updated last year
- Finetune Llama-3-8b on the MathInstruct datasetβ115Updated last year
- multi1: create o1-like reasoning chains with multiple AI providers (and locally). Supports LiteLLM as backend too for 100+ providers at oβ¦β350Updated last year
- GRadient-INformed MoEβ264Updated last year
- Contains the prompts we use to talk to various LLMs for different utilities inside the editorβ84Updated 2 years ago
- β190Updated last year
- Harness used to benchmark aider against SWE Bench benchmarksβ79Updated last year
- Prompt-to-Leaderboardβ271Updated 9 months ago
- Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.β243Updated 6 months ago
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agentsβ538Updated last week
- β109Updated last year
- Letting Claude Code develop his own MCP tools :)β123Updated 11 months ago
- Testing baseline LLMs performance across various modelsβ336Updated last week
- β159Updated 9 months ago
- Qwen 2.5 Coder 1.5B with Code Interpreterβ288Updated last year
- β80Updated 4 months ago
- Routing on Random Forest (RoRF)β239Updated last year
- proof-of-concept of Cursor's Instant Apply featureβ88Updated last year