OpenGenerativeAI / llm-colosseum
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
β1,332Updated this week
Related projects β
Alternatives and complementary repositories for llm-colosseum
- π€ MLE-Agent: Your intelligent companion for seamless AI engineering and research. π Integrate with arxiv and paper with code to provideβ¦β1,098Updated this week
- β2,754Updated 2 months ago
- Llama-3 agents that can browse the web by following instructions and talking to youβ1,352Updated 4 months ago
- streamline the fine-tuning process for multimodal models: PaliGemma, Florence-2, and Qwen2-VLβ1,392Updated this week
- Automated Design of Agentic Systemsβ1,040Updated this week
- Windows Agent Arena (WAA) πͺ is a scalable OS platform for testing and benchmarking of multi-modal AI agents.β484Updated this week
- Agent S: an open agentic framework that uses computers like a humanβ616Updated this week
- β1,281Updated this week
- Run PyTorch LLMs locally on servers, desktop and mobileβ3,393Updated this week
- Mixture of Agents using Groqβ922Updated 3 months ago
- The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework Join our Community: https://discord.com/servers/agora-999382051β¦β1,791Updated this week
- A framework for serving and evaluating LLM routers - save LLM costs without compromising quality!β3,267Updated 3 months ago
- Fine-tune LLM agents with online reinforcement learningβ996Updated 8 months ago
- We introduced a new model designed for the Code generation task. Its test accuracy on the HumanEval base dataset surpasses that of GPT-4 β¦β814Updated 4 months ago
- [ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Callingβ1,530Updated 4 months ago
- AIDE: the state-of-the-art machine learning engineer agent, generating machine learning solution code from natural language descriptions.β601Updated this week
- [NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environmentsβ1,404Updated last week
- AIOS: LLM Agent Operating Systemβ3,429Updated this week
- Reaching LLaMA2 Performance with 0.1M Dollarsβ961Updated 4 months ago
- An out-of-the-box (OOTB) version of Anthropic Claude Computer Use for Windows and macOSβ629Updated this week
- Training LLMs with QLoRA + FSDPβ1,420Updated 2 weeks ago
- β940Updated 2 weeks ago
- Deploy your agentic worfklows to productionβ1,839Updated this week
- Agent driven automation starting with the web. Try it: https://www.emergence.ai/web-automation-apiβ837Updated this week
- nanoGPT style version of Llama 3.1β1,248Updated 3 months ago
- Flexible and powerful framework for managing multiple AI agents and handling complex conversationsβ1,686Updated this week
- Together Mixture-Of-Agents (MoA) β 65.1% on AlpacaEval with OSS modelsβ2,600Updated last month
- Desktop app for prototyping and debugging LangGraph applications locally.β1,948Updated this week
- Implementation of the training framework proposed in Self-Rewarding Language Model, from MetaAIβ1,338Updated 7 months ago
- [ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?β2,014Updated this week