microsoft / Magma
[CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents
☆1,638Updated this week
Alternatives and similar repositories for Magma:
Users that are interested in Magma are comparing it to the libraries listed below
- Implementation for Describe Anything: Detailed Localized Image and Video Captioning☆1,010Updated this week
- [CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.☆1,227Updated last month
- ☆870Updated last month
- Code release for "LLMs can see and hear without any training"☆432Updated this week
- LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning☆1,980Updated 3 weeks ago
- OctoTools: An agentic framework with extensible tools for complex reasoning☆1,130Updated this week
- Official Implementation of "KBLaM: Knowledge Base augmented Language Model"☆1,286Updated last week
- Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.☆691Updated last week
- An open-sourced end-to-end VLM-based GUI Agent☆936Updated last month
- VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and clou…☆3,218Updated this week
- Frontier Multimodal Foundation Models for Image and Video Understanding☆782Updated 3 weeks ago
- The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention☆2,586Updated last month
- State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!☆943Updated last week
- Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities☆829Updated 3 weeks ago
- Witness the aha moment of VLM with less than $3.☆3,642Updated 2 months ago
- Democratizing Reinforcement Learning for LLMs☆3,210Updated last month
- MoBA: Mixture of Block Attention for Long-Context LLMs☆1,771Updated last month
- Search-R1: An Efficient, Scalable RL Training Framework for Reasoning & Search Engine Calling interleaved LLM based on veRL☆2,196Updated this week
- Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and pe…☆2,867Updated last week
- An Open Large Reasoning Model for Real-World Solutions☆1,488Updated 2 months ago
- ☆5,748Updated this week
- SpatialLM: Large Language Model for Spatial Understanding☆3,154Updated last month
- PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation☆1,762Updated this week
- Releases from OpenAI Preparedness☆729Updated last month
- Agent S: an open agentic framework that uses computers like a human☆4,558Updated this week
- [ICML2025] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction☆289Updated 2 months ago
- A live stream development of RL tunning for LLM agents☆2,665Updated this week
- Code for the Molmo Vision-Language Model☆413Updated 4 months ago
- Codebase for Aria - an Open Multimodal Native MoE☆1,033Updated 3 months ago
- Fully open data curation for reasoning models☆1,758Updated this week