MoonshotAI / K2-Vendor-VerifierLinks
Verify Precision of all Kimi K2 API Vendor
☆258Updated last week
Alternatives and similar repositories for K2-Vendor-Verifier
Users that are interested in K2-Vendor-Verifier are comparing it to the libraries listed below
Sorting:
- Coding problems used in aider's polyglot benchmark☆183Updated 9 months ago
- SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?☆196Updated this week
- LLMProc: Unix-inspired runtime that treats LLMs as processes.☆33Updated 3 months ago
- ☆135Updated 5 months ago
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆331Updated 2 weeks ago
- ☆273Updated 4 months ago
- Prompt-to-Leaderboard☆259Updated 5 months ago
- Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.☆230Updated 2 months ago
- Train your own SOTA deductive reasoning model☆108Updated 7 months ago
- Train Large Language Models on MLX.☆187Updated 3 weeks ago
- ☆441Updated last month
- Super basic implementation (gist-like) of RLMs with REPL environments.☆132Updated this week
- Official repository for "NoLiMa: Long-Context Evaluation Beyond Literal Matching"☆161Updated 3 months ago
- proof-of-concept of Cursor's Instant Apply feature☆83Updated last year
- AI benchmark runtime framework that allows you to integrate and evaluate AI tasks using Docker-based benchmarks.☆158Updated 5 months ago
- Distributed Inference for mlx LLm☆97Updated last year
- ☆232Updated 3 months ago
- Pivotal Token Search☆128Updated 3 months ago
- GRPO training code which scales to 32xH100s for long horizon terminal/coding tasks. Base agent is now the top Qwen3 agent on Stanford's T…☆277Updated last month
- ☆162Updated 2 months ago
- ☆170Updated 7 months ago
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents☆429Updated last week
- ☆68Updated 4 months ago
- Checkpoint-engine is a simple middleware to update model weights in LLM inference engines☆768Updated last week
- llmbasedos — Local-First OS Where Your AI Agents Wake Up and Work☆275Updated 2 months ago
- The DPAB-α Benchmark☆32Updated 9 months ago
- Run AI generated code in isolated sandboxes☆112Updated 8 months ago
- Claude Deep Research config for Claude Code.☆221Updated 7 months ago
- ☆93Updated 3 months ago
- j1-micro (1.7B) & j1-nano (600M) are absurdly tiny but mighty reward models.☆98Updated 3 months ago